Plugin (extension package MKP) for nvidia GPU nodes

Hi!

we’re running v2.0.0p28.cee

for the 1st time I’m trying to implement a (plugin) MKP for nvidia GPU nodes.
For the ones who are interested: this is my guideline from TU Kaiserlautern: High Performance Computing with "Elwetritsch" at the University of Kaiserslautern-Landau

but whatever I try when I follow this guide, I can’t get the numbers into WATO (or the GUI).

I assume I’m doing wrong where to put the parts in place.

what did I do so far:

I uploaded the mkp file in place. Then -

OMD[SAL]:~$ mkp list
nvidia_smi

On the target node I put this in place:
/usr/lib/check_mk_agent/plugins/nvidia_smi

On the checkmk server I did the same. (here I’m unsure: is this necessary?)
/usr/lib/check_mk_agent/plugins/nvidia_smi

My test on the target node reveals:

user@salllgpuc01:[~]\> check_mk_agent |grep -i nvidia
....
<<<nvidia_smi>>>
0 NVIDIAA100-PCIE-40GB N/A 0 0   28 41.22 250.00
1 NVIDIAA100-PCIE-40GB N/A 0 0   25 36.97 250.00
2 NVIDIAA100-PCIE-40GB N/A 0 0   25 39.14 250.00
3 NVIDIAA100-PCIE-40GB N/A 0 0   26 43.86 250.00
OMD[SAL]:~$ cmk -II salllgpuc01
OMD[SAL]:~$ cmk -O salllgpuc01
OMD[SAL]:~$ cmk -R salllgpuc01

and now, here I’m uncertain what else is to do?

Do I need to bake a new agent and deploy it?

When I want to bake a new agent, the change is not reflected in the file, meaning: I bake the agent, download it locally, but it resulted in the same baked agent before I uploaded the mkp file.

Thanks for any help.

Hi,
did you also done this part?

Also i asume you made plugin executable cause you see it on agent output so on host plugins works OK.

Best regards,
JF

yes I put the mentioned check script on the checkmk server in here:

OMD[SAL]:~/share/check_mk/checks$ ls -l
-rw-r--r-- 1 root root  2997 Aug 30 18:07 nvidia
-rw-r--r-- 1 root root  8048 Sep 12 12:23 nvidia_smi

The other file named nvidia was already existing.

Then I ran cmd -II and cmd -O .
I also restarted checkmk completely.
But nothing has changed in WATO.

In WATO, the connection test also shows the possible new nvidia data:

image

Please don’t use

the “~/local/share/check_mk/checks/” is way better for own checks. The first folder is not update safe.
I had a look at the check from the article and this will not work with actual 2.0/2.1 CMK versions. It is very old style.

You can test if this check has any syntax problems with “cmk --debug -vvII hostname”.
If this runs without any problem than also this very old check should work.

ok thanks, I moved this away, now to

OMD[SAL]:~/local/share/check_mk/checks$ tree
.
-- nvidia_smi

OMD[SAL]:~/local/share/check_mk/checks$ pwd
/omd/sites/SAL/local/share/check_mk/checks

the file looks like :

OMD[SAL]:~/local/share/check_mk/checks$ less nvidia_smi
#!/usr/bin/env python

def nvidia_smi_parse(info):

    data = {}
    for i, line in enumerate(info):
        if len(line) != 4:
            continue # Skip unexpected lines
        pool_name, pm_type, metric, value = line
        item = '%s [%s]' % (pool_name, pm_type)
        if item not in data:
            data[item] = {}

        data[item][metric] = int(value)

    return data


def inventory_nvidia_smi(info):
...

and

OMD[SAL]:~/local/share/check_mk/checks$ ls -l nvidia_smi
-rwxr-xr-x 1 SAL SAL 4015 Sep 12 11:46 nvidia_smi*

now debug:

OMD[SAL]:~/local/share/check_mk/checks$ cmk --debug -vvII salllgpuc01
Discovering services and host labels on: salllgpuc01
salllgpuc01:
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.TCP
[cpu_tracking] Start [7f8f513fb550]
[TCPFetcher] Fetch with cache settings: DefaultAgentFileCache(base_path=PosixPath('/omd/sites/SAL/tmp/check_mk/cache/salllgpuc01'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Using data from cache file /omd/sites/SAL/tmp/check_mk/cache/salllgpuc01
Got 246023 bytes data from cache
[TCPFetcher] Use cached data
Closing TCP connection to 10.141.112.10:6556
[cpu_tracking] Stop [7f8f513fb550 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7f8f513fb880]
[PiggybackFetcher] Fetch with cache settings: NoCache(base_path=PosixPath('/omd/sites/SAL/tmp/check_mk/data_source_cache/piggyback/salllgpuc01'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
[PiggybackFetcher] Execute data source
No piggyback files for 'salllgpuc01'. Skip processing.
No piggyback files for '10.141.112.10'. Skip processing.
[cpu_tracking] Stop [7f8f513fb880 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.TCP
Loading autochecks from /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Trying to acquire lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Got lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Releasing lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Released lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Stored persisted sections: lnx_packages, lnx_distro, lnx_cpuinfo, dmidecode, lnx_uname, lnx_video, lnx_ip_r, lnx_sysctl, lnx_block_devices
Using persisted section SectionName('lnx_packages')
Using persisted section SectionName('lnx_distro')
Using persisted section SectionName('lnx_cpuinfo')
Using persisted section SectionName('dmidecode')
Using persisted section SectionName('lnx_uname')
Using persisted section SectionName('lnx_video')
Using persisted section SectionName('lnx_ip_r')
Using persisted section SectionName('lnx_sysctl')
Using persisted section SectionName('lnx_block_devices')
  -> Add sections: ['check_mk', 'chrony', 'cifsmounts', 'cpu', 'df', 'diskstat', 'dmidecode', 'ipmi', 'ipmi_discrete', 'job', 'kernel', 'lnx_block_devices', 'lnx_cpuinfo', 'lnx_distro', 'lnx_if', 'lnx_ip_r', 'lnx_packages', 'lnx_sysctl', 'lnx_thermal', 'lnx_uname', 'lnx_video', 'local', 'md', 'mem', 'mounts', 'nfsmounts', 'nvidia_smi', 'postfix_mailq', 'postfix_mailq_status', 'ps_lnx', 'systemd_units', 'tcp_conn_stats', 'uptime', 'vbox_guest']
  Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections loaded
  -> Add sections: []
Received no piggyback data
+ EXECUTING HOST LABEL DISCOVERY
Trying host label discovery with: check_mk, chrony, cifsmounts, cpu, df, diskstat, dmidecode, ipmi, ipmi_discrete, job, kernel, lnx_block_devices, lnx_cpuinfo, lnx_distro, lnx_if, lnx_ip_r, lnx_packages, lnx_sysctl, lnx_thermal, lnx_uname, lnx_video, local, md, mem, mounts, nfsmounts, nvidia_smi, postfix_mailq, postfix_mailq_status, ps_lnx, systemd_units, tcp_conn_stats, uptime, vbox_guest
  cmk/os_family: linux (check_mk)
+ PERFORM HOST LABEL DISCOVERY
Trying to acquire lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Got lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Releasing lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Released lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
+ EXECUTING DISCOVERY PLUGINS (34)
  Trying discovery with: ps, domino_tasks, lnx_if, check_mk_only_from, cpu_threads, chrony, job, mem_win, mounts, kernel_performance, k8s_stats_network, uptime, diskstat, cifsmounts, kernel_util, local, systemd_units_services_summary, ipmi, systemd_units_services, nvidia_smi, mem_vmalloc, kernel, lnx_thermal, df, tcp_conn_stats, docker_container_status_uptime, vbox_guest, postfix_mailq, postfix_mailq_status, check_mk_agent_update, mem_linux, md, nfsmounts, cpu_loads
Trying to acquire lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Got lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Releasing lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Released lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
  1 check_mk_agent_update
  1 chrony
  1 cpu_loads
  1 cpu_threads
  3 df
  1 diskstat
  1 ipmi
  1 kernel_performance
  1 kernel_util
  4 lnx_if
  2 lnx_thermal
  1 mem_linux
  4 mounts
125 nfsmounts
  1 postfix_mailq
  1 postfix_mailq_status
  1 systemd_units_services_summary
  1 tcp_conn_stats
  1 uptime
SUCCESS - Found 152 services, 1 host labels

I can see nvidia_smi. But that’s it. This after cmk -O and cmk -II salllgpuc01 .

I managed meanwhile to get nnidia GPU data display in WATO.
What I did was to remove the nvidia GPU mkp file from User antonzhelyazkov and exchanged it with the mkp from smraju.
This worked out of the box (by refreshing services on the node)

There is a crash report displayed, but this is another story I must follow.

I installed mkp from smraju too but when i run check_mk --debug -vvll it show error.

OMD[monitor]:~$ check_mk --debug -vvll gb-01
Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.
Traceback (most recent call last):
  File "/omd/sites/monitor/lib/python3/cmk/base/config.py", line 2314, in _extract_check_plugins
    raise ValueError(
ValueError: Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/omd/sites/monitor/bin/check_mk", line 97, in <module>
    errors = config.load_all_agent_based_plugins(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/monitor/lib/python3/cmk/base/config.py", line 1681, in load_all_agent_based_plugins
    errors.extend(load_checks(get_check_api_context, filelist))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/monitor/lib/python3/cmk/base/config.py", line 1824, in load_checks
    ) + _extract_check_plugins(validate_creation_kwargs=did_compile)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/monitor/lib/python3/cmk/base/config.py", line 2332, in _extract_check_plugins
    raise MKGeneralException(exc) from exc
cmk.utils.exceptions.MKGeneralException: Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.
OMD[monitor]:~$

Please recommend.

1 Like

I have the same problem - is there a solution out there?
Using Check_MK version 2.2.0p12 CRE

we adapted the plugin to fix the bug.
I attached a file nvidia_smi.txt

rename it to nvidia_smi

copy this on the target node to /usr/lib/check_mk_agent/plugins/ and check if the failure disappears.

good luck!

Br Fritz
nvidia_smi.txt (5.5 KB)

Thanks for your answer! Unfortunately I get this error on the server now:

OMD[monitoring]:~$ cmk --debug -vvII node03
Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.
Traceback (most recent call last):
  File "/omd/sites/gpu_monitoring/lib/python3/cmk/base/config.py", line 2299, in _extract_check_plugins
    raise ValueError(
ValueError: Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/omd/sites/gpu_monitoring/bin/cmk", line 97, in <module>
    errors = config.load_all_agent_based_plugins(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/gpu_monitoring/lib/python3/cmk/base/config.py", line 1666, in load_all_agent_based_plugins
    errors.extend(load_checks(get_check_api_context, filelist))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/gpu_monitoring/lib/python3/cmk/base/config.py", line 1809, in load_checks
    ) + _extract_check_plugins(validate_creation_kwargs=did_compile)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/gpu_monitoring/lib/python3/cmk/base/config.py", line 2317, in _extract_check_plugins
    raise MKGeneralException(exc) from exc
cmk.utils.exceptions.MKGeneralException: Legacy check plugin still exists for check plugin nvidia_smi.power. Please remove legacy plugin.

I think the script is working well and I receive the data from the client, but that line mentioning the lagacy plugin nvidia_smi.power made me wonder if that could be the root of the problem. I did a brief search with grep on the server, but didn’t find any files connected to the lagacy plugin.

For information I use

  • Check_MK version 2.2.0p12 CRE (Raw Edition)
  • nvidia_smi 2.0 Olugin from S.M. Raju like suggested link