Skip to content

Update uabrc_check_hw_context_switch_rate error handling

Mike Hanby requested to merge update-context-switching-code-error-handling into main

The underlying issue, nodes would drain with reason NHC: Script timed out while executing "uabrc_check_hw_context_switch_rate 300000 5m" when the Prometheus server was not responding. This new code is supposed to address this.

  • Moved uabrc_hw.nhc to the scripts/ directory to for better organization
  • Updated uabrc_hw.nhc to better handle situations were the Prometheus server is not healthy/responsive
  • Updated nhc.conf to add two new arguments passed to uabrc_check_hw_context_switch_rate

The Prometheus healthcheck uses the http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy endpoint in the new uabrc_hw_prom_srv_health(). If it fails, the function returns PROMETHEUS_IS_HEALTHY back to uabrc_check_hw_context_switch_rate. If PROMETHEUS_IS_HEALTHY is -ne 0 then the function exits back to NHC with a call to nhcmain_finish which exists the loop without draining the node.

TODO: What happens when the server responds to the following with a message other than Prometheus Server is Healthy.?

$ curl http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy
Prometheus Server is Healthy.

The updated script also takes two new arguments PROMETHEUS_SRV and PROMETHEUS_PORT that need to be passed via nhc.conf:

   * || uabrc_check_hw_context_switch_rate 300000 5m grafana.ops.rc.uab.edu 9090

Merge request reports

Loading