I’ve written (with some help from gemini.google and the check_mk chatbot) a plugin to add some slurm grid checks on my slurm controller node. The output in check_mk_agent looks like:
<<<slurm>>>
0 slurm_node_blade-04-01 - OK - blade-04-01 is idle
0 slurm_node_blade-04-02 - OK - blade-04-02 is idle
0 slurm_node_blade-04-03 - OK - blade-04-03 is idle
<snipped out 89 other nodes>
0 slurm_node_states idle_nodes=91|mixed_nodes=0|allocated_nodes=0|down_nodes=0|other_nodes=0 - OK - All nodes are idle, mixed or allocated.
0 slurm_slurmctld_service - OK - slurmctld on gridboss is active.
That’s working nicely. Next step is to get check_mk to see the section and inventory the services. That’s where I’m getting stuck. The python code that I’m getting out of gemini and chatbot seems okay but running a new inventory on the client never shows any slurm services.
Eventual finish line goal is to have all the nodes monitored for their state within the slurm grid. Keep an eye on the slurmctld service, and provide some perf data about states in a color codes stacked area graph.
My first run at this was with a local check that worked but I couldn’t get the graph part of it working. Always ended up with separate graphs for the five metrics. Decided to try and work up a plugin to see if that would yield different results and hit this roadblock.
Gonna revert to the local checks for now and hope somebody can tell me where I’m going wrong.
The current version that I’ve gotten to is:
/opt/omd/sites/cmk7309/local/lib/python3/cmk_addons/plugins/slurm/agent_based/slurm.py
slurm.py (1.7 KB)
The site is 2.4.0p11.cre running on an Alma Linux 9.6 system.