Plugin (extension package MKP) for nvidia GPU nodes

Hi!

we’re running v2.0.0p28.cee

for the 1st time I’m trying to implement a (plugin) MKP for nvidia GPU nodes.
For the ones who are interested: this is my guideline from TU Kaiserlautern: High Performance Computing with "Elwetritsch" at the University of Kaiserslautern

but whatever I try when I follow this guide, I can’t get the numbers into WATO (or the GUI).

I assume I’m doing wrong where to put the parts in place.

what did I do so far:

I uploaded the mkp file in place. Then -

OMD[SAL]:~$ mkp list
nvidia_smi

On the target node I put this in place:
/usr/lib/check_mk_agent/plugins/nvidia_smi

On the checkmk server I did the same. (here I’m unsure: is this necessary?)
/usr/lib/check_mk_agent/plugins/nvidia_smi

My test on the target node reveals:

user@salllgpuc01:[~]\> check_mk_agent |grep -i nvidia
....
<<<nvidia_smi>>>
0 NVIDIAA100-PCIE-40GB N/A 0 0   28 41.22 250.00
1 NVIDIAA100-PCIE-40GB N/A 0 0   25 36.97 250.00
2 NVIDIAA100-PCIE-40GB N/A 0 0   25 39.14 250.00
3 NVIDIAA100-PCIE-40GB N/A 0 0   26 43.86 250.00
OMD[SAL]:~$ cmk -II salllgpuc01
OMD[SAL]:~$ cmk -O salllgpuc01
OMD[SAL]:~$ cmk -R salllgpuc01

and now, here I’m uncertain what else is to do?

Do I need to bake a new agent and deploy it?

When I want to bake a new agent, the change is not reflected in the file, meaning: I bake the agent, download it locally, but it resulted in the same baked agent before I uploaded the mkp file.

Thanks for any help.

Hi,
did you also done this part?

Also i asume you made plugin executable cause you see it on agent output so on host plugins works OK.

Best regards,
JF

yes I put the mentioned check script on the checkmk server in here:

OMD[SAL]:~/share/check_mk/checks$ ls -l
-rw-r--r-- 1 root root  2997 Aug 30 18:07 nvidia
-rw-r--r-- 1 root root  8048 Sep 12 12:23 nvidia_smi

The other file named nvidia was already existing.

Then I ran cmd -II and cmd -O .
I also restarted checkmk completely.
But nothing has changed in WATO.

In WATO, the connection test also shows the possible new nvidia data:

image

Please don’t use

the “~/local/share/check_mk/checks/” is way better for own checks. The first folder is not update safe.
I had a look at the check from the article and this will not work with actual 2.0/2.1 CMK versions. It is very old style.

You can test if this check has any syntax problems with “cmk --debug -vvII hostname”.
If this runs without any problem than also this very old check should work.

ok thanks, I moved this away, now to

OMD[SAL]:~/local/share/check_mk/checks$ tree
.
-- nvidia_smi

OMD[SAL]:~/local/share/check_mk/checks$ pwd
/omd/sites/SAL/local/share/check_mk/checks

the file looks like :

OMD[SAL]:~/local/share/check_mk/checks$ less nvidia_smi
#!/usr/bin/env python

def nvidia_smi_parse(info):

    data = {}
    for i, line in enumerate(info):
        if len(line) != 4:
            continue # Skip unexpected lines
        pool_name, pm_type, metric, value = line
        item = '%s [%s]' % (pool_name, pm_type)
        if item not in data:
            data[item] = {}

        data[item][metric] = int(value)

    return data


def inventory_nvidia_smi(info):
...

and

OMD[SAL]:~/local/share/check_mk/checks$ ls -l nvidia_smi
-rwxr-xr-x 1 SAL SAL 4015 Sep 12 11:46 nvidia_smi*

now debug:

OMD[SAL]:~/local/share/check_mk/checks$ cmk --debug -vvII salllgpuc01
Discovering services and host labels on: salllgpuc01
salllgpuc01:
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.TCP
[cpu_tracking] Start [7f8f513fb550]
[TCPFetcher] Fetch with cache settings: DefaultAgentFileCache(base_path=PosixPath('/omd/sites/SAL/tmp/check_mk/cache/salllgpuc01'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Using data from cache file /omd/sites/SAL/tmp/check_mk/cache/salllgpuc01
Got 246023 bytes data from cache
[TCPFetcher] Use cached data
Closing TCP connection to 10.141.112.10:6556
[cpu_tracking] Stop [7f8f513fb550 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7f8f513fb880]
[PiggybackFetcher] Fetch with cache settings: NoCache(base_path=PosixPath('/omd/sites/SAL/tmp/check_mk/data_source_cache/piggyback/salllgpuc01'), max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
[PiggybackFetcher] Execute data source
No piggyback files for 'salllgpuc01'. Skip processing.
No piggyback files for '10.141.112.10'. Skip processing.
[cpu_tracking] Stop [7f8f513fb880 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.TCP
Loading autochecks from /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Trying to acquire lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Got lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Releasing lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Released lock on /omd/sites/SAL/var/check_mk/persisted/salllgpuc01
Stored persisted sections: lnx_packages, lnx_distro, lnx_cpuinfo, dmidecode, lnx_uname, lnx_video, lnx_ip_r, lnx_sysctl, lnx_block_devices
Using persisted section SectionName('lnx_packages')
Using persisted section SectionName('lnx_distro')
Using persisted section SectionName('lnx_cpuinfo')
Using persisted section SectionName('dmidecode')
Using persisted section SectionName('lnx_uname')
Using persisted section SectionName('lnx_video')
Using persisted section SectionName('lnx_ip_r')
Using persisted section SectionName('lnx_sysctl')
Using persisted section SectionName('lnx_block_devices')
  -> Add sections: ['check_mk', 'chrony', 'cifsmounts', 'cpu', 'df', 'diskstat', 'dmidecode', 'ipmi', 'ipmi_discrete', 'job', 'kernel', 'lnx_block_devices', 'lnx_cpuinfo', 'lnx_distro', 'lnx_if', 'lnx_ip_r', 'lnx_packages', 'lnx_sysctl', 'lnx_thermal', 'lnx_uname', 'lnx_video', 'local', 'md', 'mem', 'mounts', 'nfsmounts', 'nvidia_smi', 'postfix_mailq', 'postfix_mailq_status', 'ps_lnx', 'systemd_units', 'tcp_conn_stats', 'uptime', 'vbox_guest']
  Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections loaded
  -> Add sections: []
Received no piggyback data
+ EXECUTING HOST LABEL DISCOVERY
Trying host label discovery with: check_mk, chrony, cifsmounts, cpu, df, diskstat, dmidecode, ipmi, ipmi_discrete, job, kernel, lnx_block_devices, lnx_cpuinfo, lnx_distro, lnx_if, lnx_ip_r, lnx_packages, lnx_sysctl, lnx_thermal, lnx_uname, lnx_video, local, md, mem, mounts, nfsmounts, nvidia_smi, postfix_mailq, postfix_mailq_status, ps_lnx, systemd_units, tcp_conn_stats, uptime, vbox_guest
  cmk/os_family: linux (check_mk)
+ PERFORM HOST LABEL DISCOVERY
Trying to acquire lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Got lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Releasing lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
Released lock on /omd/sites/SAL/var/check_mk/discovered_host_labels/salllgpuc01.mk
+ EXECUTING DISCOVERY PLUGINS (34)
  Trying discovery with: ps, domino_tasks, lnx_if, check_mk_only_from, cpu_threads, chrony, job, mem_win, mounts, kernel_performance, k8s_stats_network, uptime, diskstat, cifsmounts, kernel_util, local, systemd_units_services_summary, ipmi, systemd_units_services, nvidia_smi, mem_vmalloc, kernel, lnx_thermal, df, tcp_conn_stats, docker_container_status_uptime, vbox_guest, postfix_mailq, postfix_mailq_status, check_mk_agent_update, mem_linux, md, nfsmounts, cpu_loads
Trying to acquire lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Got lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Releasing lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
Released lock on /omd/sites/SAL/var/check_mk/autochecks/salllgpuc01.mk
  1 check_mk_agent_update
  1 chrony
  1 cpu_loads
  1 cpu_threads
  3 df
  1 diskstat
  1 ipmi
  1 kernel_performance
  1 kernel_util
  4 lnx_if
  2 lnx_thermal
  1 mem_linux
  4 mounts
125 nfsmounts
  1 postfix_mailq
  1 postfix_mailq_status
  1 systemd_units_services_summary
  1 tcp_conn_stats
  1 uptime
SUCCESS - Found 152 services, 1 host labels

I can see nvidia_smi. But that’s it. This after cmk -O and cmk -II salllgpuc01 .

I managed meanwhile to get nnidia GPU data display in WATO.
What I did was to remove the nvidia GPU mkp file from User antonzhelyazkov and exchanged it with the mkp from smraju.
This worked out of the box (by refreshing services on the node)

There is a crash report displayed, but this is another story I must follow.