How can I deploy an agent plugin (nvidia GPU)?

We’re using CheckMK v2.4.0p12 Enterprise Edition.

problem: I can’t figure out how to deploy a agent plugin.

I downloaded from CheckMK Exchange and enabled the nvidia_smi plugin, it shows now:

SAL@lnzcheckmk01:[~/nvidia-smi-plugin]\> mkp list
Name                Version Title                     Author          Req. Version Until Version Files State
------------------- ------- ------------------------- --------------- ------------ ------------- ----- -------------------------------
nvidia_smi          1.0.4   nvidia_smi                Paul à Brassard 2.4.0p5      None          4     Enabled (active on this site)

same:

Now looking at the agent rules as a starting point, and I assume I’m in the correct place, what next?

eg. I see here a pre-defined nvidia GPU monitoring for Windows in Agent plug-ins.
How can I have the same for Linux, and deploy this to specific nodes?

I’m searching for a menu point like create agent plugin rule

Thanks a lot!

The extension likely does not have a bakery plugin and therefor no rule in the Agents section.

You will have to deploy the agent plugin yourself. It should be in $OMD_ROOT/local/share/check_mk/agents/plugins/.

There is no menu do create a new agent plugin rule. You would need to implement it.

alright thanks. The given installer script from the developer already put the needed plugin files on the checkMK server in the proper places, so in that case I see this already:

SAL@lnzcheckmk01:[~/nvidia-smi-plugin]\> ls $OMD_ROOT/local/share/check_mk/agents/plugins/
etxsvr-check*  nvidia-smi*

The docs reveal this as:

# Agent-based plugin
cp agent_based/nvidia_smi.py /omd/sites/SITENAME/local/lib/check_mk/base/plugins/agent_based/

# Agent plugin
cp agent/nvidia-smi /omd/sites/SITENAME/local/share/check_mk/agents/plugins/
chmod +x /omd/sites/SITENAME/local/share/check_mk/agents/plugins/nvidia-smi

# Documentation
cp checkman/nvidia-smi /omd/sites/SITENAME/local/share/check_mk/checkman/

# Web metrics
cp web/plugins/metrics/nvidia-smi.py /omd/sites/SITENAME/local/share/check_mk/web/plugins/metrics/

but what next? :slight_smile:

eg. I already saved the plugin file manually on a node, kicked off a service discovery also, but nothing shows up in the UI.

root@lnzsdfml04:/usr/lib/check_mk_agent/plugins# ls
3600  mk_inventory  nvidia_smi

root@lnzsdfml04:/usr/lib/check_mk_agent/plugins# check_mk_agent |grep -i nvidia
...
<<<nvidia_smi>>>
0 NVIDIA-A100-80GB-PCIe 0 0 0.0 0 0 32 0 300.0 0 92.00.68.00.01
1 NVIDIA-A100-80GB-PCIe 0 0 0.69 0 0 39 0 300.0 1 92.00.68.00.01
2 NVIDIA-A100-80GB-PCIe 0 0 0.0 0 0 30 0 300.0 0 92.00.68.00.01
3 NVIDIA-A100-80GB-PCIe 0 0 0.0 0 0 31 0 300.0 0 92.00.68.00.01

according to the dev docs paul / nvidia-smi-plugin · GitLab and I assume he already did this for himself, he published:

## Agent Deployment

After installation, the plugin will be available in the CheckMK Agent Bakery:

1. Go to Setup → Agents → Windows, Linux, Solaris, AIX
2. Create or edit an agent configuration
3. The nvidia-smi plugin should appear in available plugins
4. Enable it for hosts with NVIDIA GPUs
5. Bake and deploy agents

I need to know how to accomplish no. 2

I haven’t installed the mkp and don’t use the bakery, but you might be able to achieve your goal this way.


search for nvidia

I think the mkp you installed replaces the nvidia_smi.py plugin supplied by checkmk. This means you can use Checkmk’s Backery rules, and since it’s a Python script, it also runs on Linux. Don’t be distracted by the “(windows )” in the rule name.

Have you read the troubleshooting section at “paul / nvidia-smi-plugin · GitLab

yes reading into that, thanks.

I’ve spotted a crash… not good :astonished:

SAL@lnzcheckmk01:[~/var/check_mk/crashes/section/2e6b13c4-a5d3-11f0-a18c-005056bad0c5]\> less crash.info

found the issue. Although the published doc states it is v2 compatible (checkmk 2.4 or higher), it is not.

eg. nvidia_smi.py published:

from .agent_based_api.v1.type_defs import (
    CheckResult,
    DiscoveryResult,
)

from .agent_based_api.v1 import (
    register,
    render,
    Result,
    Metric,
    State,
    Service,
)

I had to rewrite this to be v2 compatible:

from cmk.agent_based.v2 import (
    AgentSection,
    CheckPlugin,
    CheckResult,
    DiscoveryResult,
    Metric,
    Result,
    Service,
    State,
    StringTable,
)

now it works, the monitoring is visible in the UI.