Monitoring NVIDIA SMI From linux host

Hi All, anyone aware of any plugin to monitor nvidia smi from linux host? I was aware of following but they are quite old and not working on 2.2.*

Checkmk Exchange

Checkmk Exchange

Same problem here… Can´t get it to work with 2.2…

How can we setup GPU monitoring on red hat systems ?

Hello @Error404m ,
And welcome to the Community!

I think this seems more like a general question that would be better answered in a separate topic or in a more similar topic created earlier.

Please use the forum search to take a look at how the question has been approached by the other users – it is likely this has been discussed before.

If there is no answer, I would suggest creating a topic in the general category with some more details on what you would like to achieve. That way more experienced community members might be able to help you out :slight_smile:

Hi Sara,

I think the topic is still very hot and perfectly described in this thread.
We have the same problem on how to monitor NVidia GPU cards on Linux.
And you might have heard about the recent hype around AI and all AI machines do need NVidia GPU and almost any of them runs on linux.
As there is already a plugin for NVidia on Windows maybe Checkmk can create a linux version as well.

thanks

Why not using the Windows script modified as bash on Linux?
The command “nvidia-smi” is exactly the same on Linux.

2 Likes

Yes, good idea, but who creates the plugin for that ?

Why should someone creating a new plugin?
The check is already existing inside CMK.
You only need to produce the same output as on Windows and it will work.

Are you saying the normal cmk agent on linux is providing the values from the GPU, using the nvidia-smi ?

Not the agent - the script that you should port from Windows Powershell to Linux bash provides the data. Rewrite the script in bash and put it inside the "/var/lib/check_mk_agent/plugins/ folder. Then you have the check results in your monitoring.

1 Like

Porting a script is really out of my scope !
Either Checkmk is interested in this topic or somebody else, but me tasks are very different from becoming a programmer for cmk plugins !
And even if I find someone who would port it, then one need to make a plugin for that which needs to be integrated into actual cmk version. This all is far out of my range.

Just port this Agent Plugin from powershell to linux bash

Thats all you have to do

1 Like

As i said before the plugin is already inside CMK. No need to program anything.

This is a one shot ChatGPT translation of the script :smiley: you only need to fix some paths.

#!/bin/bash

echo "<<<nvidia_smi:sep(9)>>>"
MK_CONFDIR="${MK_CONFDIR:-/etc/checkmk/agent/config}"
CONFIG_FILE="${MK_CONFDIR}/nvidia_smi_cfg.ps1"
DEFAULT_NVIDIA_SMI_PATH="/usr/local/cuda/bin/nvidia-smi"

if [[ -f "$CONFIG_FILE" ]]; then
    source "$CONFIG_FILE"
fi

if [[ -x "$nvidia_smi_path" ]]; then
    "$nvidia_smi_path" -q -x
    exit
fi

if [[ -x "$DEFAULT_NVIDIA_SMI_PATH" ]]; then
    "$DEFAULT_NVIDIA_SMI_PATH" -q -x
    exit
fi

if command -v nvidia-smi &>/dev/null; then
    nvidia-smi -q -x
    exit
fi

echo "ERROR: nvidia-smi was not found in:"
echo "- $nvidia_smi_path (configured path)"
echo "- $DEFAULT_NVIDIA_SMI_PATH (default path)"
echo "- system PATH"
2 Likes

Just wanted to add what Andreas said.
As a test, you can just put the following in a test.sh(mark it executable) under /usr/lib/check_mk_agent/plugins/

#!/bin/bash
echo "<<<nvidia_smi:sep(9)>>>"
#if its not in a standard path then adjust it
nvidia-smi -q -x

Then you may see the following services in Checkmk out of the box like in my test environment:

If this works for you then please share your feedback.

1 Like

We already use a plugin which is like what is recommended in this thread. It also includes a litte hack because nvidia did some changes to the XML elements of the nvidia-smi output

#!/bin/bash
#
# Make Werk 14723 available to Linux
 
inpath() {
     # replace "if type [somecmd]" idiom
     # 'command -v' tends to be more robust vs 'which' and 'type' based tests
     command -v "${1:?No command to test}" >/dev/null 2>&1
}
 
section_nvidia_smi() {
     if inpath nvidia-smi; then
         echo '<<<nvidia_smi:sep(9)>>>'
         nvidia-smi -q -x
     fi
}
 
# Zeile für den Agent
#[ -z "${MK_SKIP_NVIDIA}" ] && _log_section_time section_nvidia_smi
 
section_nvidia_smi \
        | sed \
        -e 's/power_readings>/power_readings>\n<power_management>Supported<\/power_management>/' \
        -e 's/gpu_power_readings/power_readings/'

That pipe through sed makes the XML output parsable by the current python code.

There are two pull requests:

  1. Add this code (without the sed hack) to the Linux agent: Add feature: auto-discovery for nvidia-smi on linux by mayrstefan · Pull Request #680 · Checkmk/checkmk · GitHub
  2. Address changed XML structure of current nvidia-smi versions: Fix parsing of current nvidia_smi section by mayrstefan · Pull Request #681 · Checkmk/checkmk · GitHub

The next release will include Werk #16652: NVIDIA Graphics Card: Fix parsing error on new data format.