Update uabrc_check_hw_context_switch_rate error handling
The underlying issue, nodes would drain with reason NHC: Script timed out while executing "uabrc_check_hw_context_switch_rate 300000 5m"
when the Prometheus server was not responding. This new code is supposed to address this.
- Moved
uabrc_hw.nhc
to thescripts/
directory to for better organization - Updated
uabrc_hw.nhc
to better handle situations were the Prometheus server is not healthy/responsive - Updated
nhc.conf
to add two new arguments passed touabrc_check_hw_context_switch_rate
The Prometheus healthcheck uses the http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy
endpoint in the new uabrc_hw_prom_srv_health()
. If it fails, the function returns PROMETHEUS_IS_HEALTHY
back to uabrc_check_hw_context_switch_rate
. If PROMETHEUS_IS_HEALTHY
is -ne 0
then the function exits back to NHC with a call to nhcmain_finish
which exists the loop without draining the node.
TODO: What happens when the server responds to the following with a message other than Prometheus Server is Healthy.
?
$ curl http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy
Prometheus Server is Healthy.
The updated script also takes two new arguments PROMETHEUS_SRV
and PROMETHEUS_PORT
that need to be passed via nhc.conf
:
* || uabrc_check_hw_context_switch_rate 300000 5m grafana.ops.rc.uab.edu 9090