After upgrade to 2.2.0p12: Parsing of section nvidia_smi failed - please submit a crash report!

yetanotheraccount · October 20, 2023, 9:13am

Hello crowd.
Again, after some googling, I need the help of the experts.

My CheckMK server is Ubuntu 22.04.
CheckMK version

before the upgrade was 2.1.0p33 (Raw)
after the upgrade is 2.2.0p12 (Raw)

I am running a few Ubuntu 22.04 servers with Nvidia cards installed. CheckMK agent version was also upgraded from 2.1.0p33 (Raw) to 2.2.0p12 (Raw).

A few month ago, I installed the “nvidia_smi” plugin in /usr/lib/check_mk_agent/plugins and it all worked fine → until the upgrade.

After the upgrade to 2.2.0p12, I get the following error message on all server swith Nvidia cards:

“Parsing of section nvidia_smi failed - please submit a crash report! (Crash-ID: d4d3f7f8-6f20-11ee-9ab3-df4a933188d2)”

To be honest, I have no idea how to approach/fix this problem…

Bakers-Admin · December 19, 2023, 12:53pm

Hello,

I seem to have the same problem, I have a client on which I want to monitor the utilisation of the graphics card in order to detect possible sources of error… Unfortunately I get the same error as the previous speaker.

Greetings,
Marcus

yetanotheraccount · December 20, 2023, 10:24am

At least a “brother in pain”.

What really confuses me:
When I call the script directly with /usr/lib/check_mk_agent/plugins/nvidia_smi

I get this output (which , to me, looks pretty okay):

<<<nvidia_smi>>>
smi nvidia gpu_utilization 0
smi nvidia memory_used 0
smi nvidia temperature 39
smi nvidia graphics_clock 210
smi nvidia sm_clock 210
smi nvidia msm_clock 405
smi nvidia gpu_utilization 0
smi nvidia memory_used 0
smi nvidia temperature 41
smi nvidia graphics_clock 210
smi nvidia sm_clock 210
smi nvidia msm_clock 405
smi nvidia gpu_utilization 0
smi nvidia memory_used 0
smi nvidia temperature 40
smi nvidia graphics_clock 210
smi nvidia sm_clock 210
smi nvidia msm_clock 405

Bakers-Admin · December 20, 2023, 11:48am

yes,

everything looks good in my case too. That was one of the first things I looked at. Just more info due to Windows.

I currently have 2.2.0p16 RAW on it (agent and server) and the script that comes with the version…

Even if I enforce Nvidia Monitoring as a service, it just says that the service has no information, and above it says that it cannot parse the block.

So at first glance it seems to me to be an issue with the Check_MK server, not the agent.

A “.\cmk-agent-ctl.exe dump” gives good information in the nvdia_smi block:

> <<<>>>
> <<<nvidia_smi:sep(9)>>>
> <?xml version="1.0" ?>
> <!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
> <nvidia_smi_log>
>         <timestamp>Wed Dec 20 12:34:32 2023</timestamp>
>         <driver_version>5xx.xx</driver_version>
>         <cuda_version>1x.x</cuda_version>
>         <attached_gpus>1</attached_gpus>
>         <gpu id="00000000:01:00.0">
>                 <product_name>NVIDIA GeForce GTX 1xxx</product_name>
>                 <product_brand>GeForce</product_brand>
>                 <product_architecture>Turing</product_architecture>
>                 <display_mode>Enabled</display_mode>
>                 <display_active>Enabled</display_active>
>                 <persistence_mode>N/A</persistence_mode>
>                 <addressing_mode>N/A</addressing_mode>
>                 <mig_mode>
>                         <current_mig>N/A</current_mig>
>                         <pending_mig>N/A</pending_mig>
>                 </mig_mode>
>                 <mig_devices>
>                         None
>                 </mig_devices>
>                 <accounting_mode>Disabled</accounting_mode>
>                 <accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
>                 <driver_model>
>                         <current_dm>WDDM</current_dm>
>                         <pending_dm>WDDM</pending_dm>
>                 </driver_model>
>                 <serial>N/A</serial>
>                 <uuid>GPU-c9exxx340b3</uuid>
>                 <minor_number>N/A</minor_number>
>                 <vbios_version>95</vbios_version>
>                 <multigpu_board>No</multigpu_board>
>                 <board_id>0x100</board_id>
>                 <board_part_number>N/A</board_part_number>
>                 <gpu_part_number>1Fxxx1</gpu_part_number>
>                 <gpu_fru_part_number>N/A</gpu_fru_part_number>
>                 <gpu_module_id>1</gpu_module_id>
>                 <inforom_version>
>                         <img_version>G001</img_version>
>                         <oem_object>1.1</oem_object>
>                         <ecc_object>N/A</ecc_object>
>                         <pwr_object>N/A</pwr_object>
>                 </inforom_version>
>                 <inforom_bbx_flush>
>                         <latest_timestamp>N/A</latest_timestamp>
>                         <latest_duration>N/A</latest_duration>
>                 </inforom_bbx_flush>
>                 <gpu_operation_mode>
>                         <current_gom>N/A</current_gom>
>                         <pending_gom>N/A</pending_gom>
>                 </gpu_operation_mode>
>                 <gsp_firmware_version>N/A</gsp_firmware_version>
>                 <c2c_mode>N/A</c2c_mode>
>                 <gpu_virtualization_mode>
>                         <virtualization_mode>None</virtualization_mode>
>                         <host_vgpu_mode>N/A</host_vgpu_mode>
>                 </gpu_virtualization_mode>
>                 <gpu_reset_status>
>                         <reset_required>No</reset_required>
>                         <drain_and_reset_recommended>N/A</drain_and_reset_recommended>
>                 </gpu_reset_status>
>                 <ibmnpu>
>                         <relaxed_ordering_mode>N/A</relaxed_ordering_mode>
>                 </ibmnpu>
>                 <pci>
>                         <pci_bus>01</pci_bus>
>                         <pci_device>00</pci_device>
>                         <pci_domain>0000</pci_domain>
>                         <pci_device_id>1DE</pci_device_id>
>                         <pci_bus_id>0000.0</pci_bus_id>
>                         <pci_sub_system_id>1xxxDE</pci_sub_system_id>
>                         <pci_gpu_link_info>
>                                 <pcie_gen>
>                                         <max_link_gen>3</max_link_gen>
>                                         <current_link_gen>3</current_link_gen>
>                                         <device_current_link_gen>3</device_current_link_gen>
>                                         <max_device_link_gen>3</max_device_link_gen>
>                                         <max_host_link_gen>5</max_host_link_gen>
>                                 </pcie_gen>
>                                 <link_widths>
>                                         <max_link_width>16x</max_link_width>
>                                         <current_link_width>16x</current_link_width>
>                                 </link_widths>
>                         </pci_gpu_link_info>
>                         <pci_bridge_chip>
>                                 <bridge_chip_type>N/A</bridge_chip_type>
>                                 <bridge_chip_fw>N/A</bridge_chip_fw>
>                         </pci_bridge_chip>
>                         <replay_counter>0</replay_counter>
>                         <replay_rollover_counter>0</replay_rollover_counter>
>                         <tx_util>0 KB/s</tx_util>
>                         <rx_util>0 KB/s</rx_util>
>                         <atomic_caps_inbound>N/A</atomic_caps_inbound>
>                         <atomic_caps_outbound>N/A</atomic_caps_outbound>
>                 </pci>
>                 <fan_speed>20 %</fan_speed>
>                 <performance_state>P2</performance_state>
>                 <clocks_event_reasons>
>                         <clocks_event_reason_gpu_idle>Active</clocks_event_reason_gpu_idle>
>                         <clocks_event_reason_applications_clocks_setting>Not Active</clocks_event_reason_applications_clocks_setting>
>                         <clocks_event_reason_sw_power_cap>Not Active</clocks_event_reason_sw_power_cap>
>                         <clocks_event_reason_hw_slowdown>Not Active</clocks_event_reason_hw_slowdown>
>                         <clocks_event_reason_hw_thermal_slowdown>Not Active</clocks_event_reason_hw_thermal_slowdown>
>                         <clocks_event_reason_hw_power_brake_slowdown>Not Active</clocks_event_reason_hw_power_brake_slowdown>
>                         <clocks_event_reason_sync_boost>Not Active</clocks_event_reason_sync_boost>
>                         <clocks_event_reason_sw_thermal_slowdown>Not Active</clocks_event_reason_sw_thermal_slowdown>
>                         <clocks_event_reason_display_clocks_setting>Not Active</clocks_event_reason_display_clocks_setting>
>                 </clocks_event_reasons>
>                 <fb_memory_usage>
>                         <total>4096 MiB</total>
>                         <reserved>139 MiB</reserved>
>                         <used>268 MiB</used>
>                         <free>3688 MiB</free>
>                 </fb_memory_usage>
>                 <bar1_memory_usage>
>                         <total>256 MiB</total>
>                         <used>2 MiB</used>
>                         <free>254 MiB</free>
>                 </bar1_memory_usage>
>                 <cc_protected_memory_usage>
>                         <total>N/A</total>
>                         <used>N/A</used>
>                         <free>N/A</free>
>                 </cc_protected_memory_usage>
>                 <compute_mode>Default</compute_mode>
>                 <utilization>
>                         <gpu_util>0 %</gpu_util>
>                         <memory_util>10 %</memory_util>
>                         <encoder_util>0 %</encoder_util>
>                         <decoder_util>0 %</decoder_util>
>                         <jpeg_util>0 %</jpeg_util>
>                         <ofa_util>0 %</ofa_util>
>                 </utilization>
>                 <encoder_stats>
>                         <session_count>0</session_count>
>                         <average_fps>0</average_fps>
>                         <average_latency>0</average_latency>
>                 </encoder_stats>
>                 <fbc_stats>
>                         <session_count>0</session_count>
>                         <average_fps>0</average_fps>
>                         <average_latency>0</average_latency>
>                 </fbc_stats>
>                 <ecc_mode>
>                         <current_ecc>N/A</current_ecc>
>                         <pending_ecc>N/A</pending_ecc>
>                 </ecc_mode>
>                 <ecc_errors>
>                         <volatile>
>                                 <sram_correctable>N/A</sram_correctable>
>                                 <sram_uncorrectable>N/A</sram_uncorrectable>
>                                 <dram_correctable>N/A</dram_correctable>
>                                 <dram_uncorrectable>N/A</dram_uncorrectable>
>                         </volatile>
>                         <aggregate>
>                                 <sram_correctable>N/A</sram_correctable>
>                                 <sram_uncorrectable>N/A</sram_uncorrectable>
>                                 <dram_correctable>N/A</dram_correctable>
>                                 <dram_uncorrectable>N/A</dram_uncorrectable>
>                         </aggregate>
>                 </ecc_errors>
>                 <retired_pages>
>                         <multiple_single_bit_retirement>
>                                 <retired_count>N/A</retired_count>
>                                 <retired_pagelist>N/A</retired_pagelist>
>                         </multiple_single_bit_retirement>
>                         <double_bit_retirement>
>                                 <retired_count>N/A</retired_count>
>                                 <retired_pagelist>N/A</retired_pagelist>
>                         </double_bit_retirement>
>                         <pending_blacklist>N/A</pending_blacklist>
>                         <pending_retirement>N/A</pending_retirement>
>                 </retired_pages>
>                 <remapped_rows>N/A</remapped_rows>
>                 <temperature>
>                         <gpu_temp>29 C</gpu_temp>
>                         <gpu_temp_tlimit>N/A</gpu_temp_tlimit>
>                         <gpu_temp_max_threshold>101 C</gpu_temp_max_threshold>
>                         <gpu_temp_slow_threshold>98 C</gpu_temp_slow_threshold>
>                         <gpu_temp_max_gpu_threshold>95 C</gpu_temp_max_gpu_threshold>
>                         <gpu_target_temperature>83 C</gpu_target_temperature>
>                         <memory_temp>N/A</memory_temp>
>                         <gpu_temp_max_mem_threshold>N/A</gpu_temp_max_mem_threshold>
>                 </temperature>
>                 <supported_gpu_target_temp>
>                         <gpu_target_temp_min>65 C</gpu_target_temp_min>
>                         <gpu_target_temp_max>93 C</gpu_target_temp_max>
>                 </supported_gpu_target_temp>
>                 <gpu_power_readings>
>                         <power_state>P2</power_state>
>                         <power_draw>N/A</power_draw>
>                         <current_power_limit>75.00 W</current_power_limit>
>                         <requested_power_limit>75.00 W</requested_power_limit>
>                         <default_power_limit>75.00 W</default_power_limit>
>                         <min_power_limit>45.00 W</min_power_limit>
>                         <max_power_limit>75.00 W</max_power_limit>
>                 </gpu_power_readings>
>                 <gpu_memory_power_readings>
>                         <power_draw>N/A</power_draw>
>                 </gpu_memory_power_readings>
>                 <module_power_readings>
>                         <power_state>P2</power_state>
>                         <power_draw>N/A</power_draw>
>                         <current_power_limit>N/A</current_power_limit>
>                         <requested_power_limit>N/A</requested_power_limit>
>                         <default_power_limit>N/A</default_power_limit>
>                         <min_power_limit>N/A</min_power_limit>
>                         <max_power_limit>N/A</max_power_limit>
>                 </module_power_readings>
>                 <clocks>
>                         <graphics_clock>300 MHz</graphics_clock>
>                         <sm_clock>300 MHz</sm_clock>
>                         <mem_clock>5750 MHz</mem_clock>
>                         <video_clock>540 MHz</video_clock>
>                 </clocks>
>                 <applications_clocks>
>                         <graphics_clock>N/A</graphics_clock>
>                         <mem_clock>N/A</mem_clock>
>                 </applications_clocks>
>                 <default_applications_clocks>
>                         <graphics_clock>N/A</graphics_clock>
>                         <mem_clock>N/A</mem_clock>
>                 </default_applications_clocks>
>                 <deferred_clocks>
>                         <mem_clock>N/A</mem_clock>
>                 </deferred_clocks>
>                 <max_clocks>
>                         <graphics_clock>2100 MHz</graphics_clock>
>                         <sm_clock>2100 MHz</sm_clock>
>                         <mem_clock>6001 MHz</mem_clock>
>                         <video_clock>1950 MHz</video_clock>
>                 </max_clocks>
>                 <max_customer_boost_clocks>
>                         <graphics_clock>N/A</graphics_clock>
>                 </max_customer_boost_clocks>
>                 <clock_policy>
>                         <auto_boost>N/A</auto_boost>
>                         <auto_boost_default>N/A</auto_boost_default>
>                 </clock_policy>
>                 <voltage>
>                         <graphics_volt>N/A</graphics_volt>
>                 </voltage>
>                 <fabric>
>                         <state>N/A</state>
>                         <status>N/A</status>
>                 </fabric>
>                 <supported_clocks>
>                         <supported_mem_clock>
>                                 <value>6001 MHz</value>
>                                 <supported_graphics_clock>2100 MHz</supported_graphics_clock>
>                                 ...
>                                 <supported_graphics_clock>315 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>300 MHz</supported_graphics_clock>
>                         </supported_mem_clock>
>                         <supported_mem_clock>
>                                 <value>5751 MHz</value>
>                                 <supported_graphics_clock>2100 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>2085 MHz</supported_graphics_clock>
>                                 ...
>                                 <supported_graphics_clock>315 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>300 MHz</supported_graphics_clock>
>                         </supported_mem_clock>
>                         <supported_mem_clock>
>                                 <value>5001 MHz</value>
>                                 <supported_graphics_clock>2100 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>2085 MHz</supported_graphics_clock>
>                                 ...
>                                 <supported_graphics_clock>315 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>300 MHz</supported_graphics_clock>
>                         </supported_mem_clock>
>                         <supported_mem_clock>
>                                 <value>810 MHz</value>
>                                 <supported_graphics_clock>2100 MHz</supported_graphics_clock>
>                                 ...
>                                 <supported_graphics_clock>315 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>300 MHz</supported_graphics_clock>
>                         </supported_mem_clock>
>                         <supported_mem_clock>
>                                 <value>405 MHz</value>
>                                 <supported_graphics_clock>645 MHz</supported_graphics_clock>
>                                 ...
>                                 <supported_graphics_clock>315 MHz</supported_graphics_clock>
>                                 <supported_graphics_clock>300 MHz</supported_graphics_clock>
>                         </supported_mem_clock>
>                 </supported_clocks>
>                 <processes>
>                         <process_info>
>                                 <gpu_instance_id>N/A</gpu_instance_id>
>                                 <compute_instance_id>N/A</compute_instance_id>
>                                 <pid>2068</pid>
>                                 <type>C+G</type>
>                                 <process_name>C:\Windows\System32\LogonUI.exe</process_name>
>                                 <used_memory>N/A</used_memory>
>                         </process_info>
>                         <process_info>
>                                 <gpu_instance_id>N/A</gpu_instance_id>
>                                 <compute_instance_id>N/A</compute_instance_id>
>                                 <pid>2184</pid>
>                                 <type>C+G</type>
>                                 <process_name>C:\Windows\System32\dwm.exe</process_name>
>                                 <used_memory>N/A</used_memory>
>                         </process_info>
>                 </processes>
>                 <accounted_processes>
>                 </accounted_processes>
>         </gpu>
> 
> </nvidia_smi_log>

Bakers-Admin · December 20, 2023, 11:58am

But I just found something, it must have been the change:

It was in version 2.2.0p11, which fits with the fact that it no longer worked for you in p12, you must have skipped one or two patch versions, I jumped from p6 to p16.

It also says that a rediscover must be carried out… I only created the host yesterday… So that doesn’t help for me, maybe for you?

yetanotheraccount · December 20, 2023, 1:28pm

IMHO this was about the PCI bus ID wasn´t shown in the output. I am pretty sure that is/was not the problem… But thanks anyway.

jens-maus · January 11, 2024, 9:28am

I just ran into the same issue recently after having setup a new server with latest nvidia drivers. After some investigation I think the issue seems to stem from changed xml output of the nvidia-smi tool because in our case just the newer server with latest nvidia drivers (535.129.03) are affected by this checkmk parsing/crash issue.

In fact, after some deeper investigation I could fix the issue locally here by applying two modifications:

On the CheckMK server: modify /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py:

--- /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py.orig 2023-10-16 00:58:49.000000000 +0200
+++ /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py	2024-01-11 10:21:04.515439934 +0100
@@ -160,7 +160,7 @@
                         get_text_from_element(gpu.find("power_readings/power_state")),
                     ),
                     power_management=PowerManagement(
-                        get_text_from_element(gpu.find("power_readings/power_management"))
+                        get_text_from_element(gpu.find("power_readings/power_management")) if get_text_from_element(gpu.find("power_readings/power_management")) is not None else "Supported"
                     ),
                     power_draw=get_float_from_element(gpu.find("power_readings/power_draw"), "W"),
                     power_limit=get_float_from_element(gpu.find("power_readings/power_limit"), "W"),

On the CheckMK client: modify the /usr/lib/check_mk_agent/plugins/nvidia_smi.sh to look like:

#!/bin/sh
echo "<<<nvidia_smi:sep(9)>>>"
/usr/bin/nvidia-smi -q -x | sed 's/gpu_power_readings/power_readings/'

So the issue seems to be twofold: 1. the nvidia-smi output was changed so that the whole <power_management> tag seems to be missing/not available under the power readings xml branch. and 2. the previously called <power_readings> xml branch is now called <gpu_power_readings>. Therefore, the output of nvidia-smi -q -x does not match the expectations in the nvidia_smi.py plugin in the latest CheckMK versions. However, the above modifications should solve these issues. At least it did here.

yetanotheraccount · January 12, 2024, 11:03am

Hey - thanks for the info!
I will try asap.

Just out of curiosity - which CheckMK version are you running ?

jens-maus · January 12, 2024, 12:59pm

Always the latest one Thus, 2.2.0p17…

yetanotheraccount · January 12, 2024, 1:14pm

Well - at the moment of this writing that would be 2.2.0p18
But thank you anyway!

yetanotheraccount · January 12, 2024, 2:09pm

Unfortunately I don´t have a /usr/lib/check_mk_agent/plugins/nvidia_smi.sh file.
But I have /usr/lib/check_mk_agent/plugins/nvidia_smi file (size 4142 byte)

And in the /usr/lib/check_mk_agent/plugins/nvidia_smi file, there is no section like this:

echo "<<<nvidia_smi:sep(9)>>>"
/usr/bin/nvidia-smi -q -x | sed 's/gpu_power_readings/power_readings/'

Not even close…

Can you please do me a favor and post the full content of you /usr/lib/check_mk_agent/plugins/nvidia_smi.sh file ?

jens-maus · January 12, 2024, 2:12pm

This is the complete content of the nvidia_smi.sh script. So simply compare what your nvidia_smi script is doing and in case it calls nvidia-smi add the sed call behind it…

yetanotheraccount · January 12, 2024, 2:40pm

Hmmm…

Thank you - but I don´t get it…

I copied the /opt/omd/sites/<mysite>/local/share/check_mk/agents/plugins/nvidia_smi file from the CheckMK server to /usr/lib/check_mk_agent/plugins/nvidia_smi on the CheckMK client.

Isn´t that what we are supposed to do ?

And that file has the following content:

#!/usr/bin/python3
# -*- encoding: utf-8; py-indent-offset: 4 -*-
# +------------------------------------------------------------------+
# |             ____ _               _        __  __ _  __           |
# |            / ___| |__   ___  ___| | __   |  \/  | |/ /           |
# |           | |   | '_ \ / _ \/ __| |/ /   | |\/| | ' /            |
# |           | |___| | | |  __/ (__|   <    | |  | | . \            |
# |            \____|_| |_|\___|\___|_|\_\___|_|  |_|_|\_\           |
# |                                                                  |
# | Copyright Mathias Kettner 2012             mk@mathias-kettner.de |
# +------------------------------------------------------------------+
#
# This file is part of Check_MK.
# The official homepage is at http://mathias-kettner.de/check_mk.
#
# check_mk is free software;  you can redistribute it and/or modify it
# under the  terms of the  GNU General Public License  as published by
# the Free Software Foundation in version 2.  check_mk is  distributed
# in the hope that it will be useful, but WITHOUT ANY WARRANTY;  with-
# out even the implied warranty of  MERCHANTABILITY  or  FITNESS FOR A
# PARTICULAR PURPOSE. See the  GNU General Public License for more de-
# ails.  You should have  received  a copy of the  GNU  General Public
# License along with GNU Make; see the file  COPYING.  If  not,  write
# to the Free Software Foundation, Inc., 51 Franklin St,  Fifth Floor,
# Boston, MA 02110-1301 USA.

#######################################
# Check developed by
#######################################
# Dr. Markus Hillenbrand
# University of Kaiserslautern, Germany
# hillenbr@rhrk.uni-kl.de
#######################################
#######################################
# Script modified by S M Raju
#######################################

# the inventory functions

def inventory_nvidia_smi_fan(info):
    inventory = []
    for line in info:
        if line[2] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

def inventory_nvidia_smi_gpuutil(info):
    inventory = []
    for line in info:
        if line[3] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

def inventory_nvidia_smi_memutil(info):
    inventory = []
    for line in info:
        if line[4] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

def inventory_nvidia_smi_errors1(info):
    inventory = []
    for line in info:
        if line[5] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory
def inventory_nvidia_smi_errors2(info):
    inventory = []
    for line in info:
        if line[6] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

def inventory_nvidia_smi_temp(info):
    inventory = []
    for line in info:
        if line[7] != 'N/A':
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

def inventory_nvidia_smi_power(info):
    inventory = []
    for line in info:
        if line[8] != 'N/A' and line[9] != "N/A":
           inventory.append( ("GPU"+line[0], "", None) )
    return inventory

# the check functions

def check_nvidia_smi_fan(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[2])
           perfdata = [('fan', value, 90, 95, 0, 100 )]
           if value > 95:
              return (2, "CRITICAL - %s fan speed is %d%%" % (line[1], value), perfdata)
           elif value > 90:
              return (1, "WARNING - %s fan speed is %d%%" % (line[1], value), perfdata)
           else:
              return (0, "OK - %s fan speed is %d%%" % (line[1], value), perfdata)
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_gpuutil(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[3])
           perfdata = [('gpuutil', value, 100, 100, 0, 100 )]
           return (0, "OK - %s utilization is %s%%" % (line[1], value), perfdata)
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_memutil(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[4])
           perfdata = [('memutil', value, 100, 100, 0, 100 )]
           if value > 95:
              return (2, "CRITICAL - %s memory utilization is %d%%" % (line[1], value), perfdata)
           elif value > 90:
              return (1, "WARNING - %s memory utilization is %d%%" % (line[1], value), perfdata)
           else:
              return (0, "OK - %s memory utilization is %d%%" % (line[1], value), perfdata)
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors1(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[5])
           if value > 500:
              return (2, "CRITICAL - %s single bit error counter is %d" % (line[1], value))
           if value > 100:
              return (1, "WARNING - %s single bit error counter is %d" % (line[1], value))
           else:
              return (0, "OK - %s single bit error counter is %d" % (line[1], value))
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)
def check_nvidia_smi_errors2(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[6])
           if value > 500:
              return (2, "CRITICAL - %s double bit error counter is %d" % (line[1], value))
           if value > 100:
              return (1, "WARNING - %s double bit error counter is %d" % (line[1], value))
           else:
              return (0, "OK - %s double bit error counter is %d" % (line[1], value))
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_temp(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           value = int(line[7])
           perfdata = [('temp', value, 80, 90, 0, 95 )]
           if value > 90:
              return (2, "CRITICAL - %s temperature is %d°C" % (line[1], value), perfdata)
           elif value > 80:
              return (1, "WARNING - %s temperature is %d°C" % (line[1], value), perfdata)
           else:
              return (0, "OK - %s temperature is %d°C" % (line[1], value), perfdata)
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_power(item, params, info):
    for line in info:
        if "GPU"+line[0] == item:
           draw = float(line[8])
           limit = float(line[9])
           value = draw * 100.0 / limit
           perfdata = [('power', draw, limit * 0.8, limit * 0.9, 0, limit )]
           if value > 90:
              return (2, "CRITICAL - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
           elif value > 80:
              return (1, "WARNING - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
           else:
              return (0, "OK - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
    return (3, "UNKNOWN - GPU %s not found in agent output" % item)

# declare the check to Check_MK

check_info['nvidia_smi.fan'] = {
    "check_function" :      check_nvidia_smi_fan,
    "inventory_function" :  inventory_nvidia_smi_fan,
    "service_description" :  "%s fan speed",
    "has_perfdata" :        True,
    "group" :               "nvidia_smi"
}
check_info['nvidia_smi.gpuutil'] = {
    "check_function" :      check_nvidia_smi_gpuutil,
    "inventory_function" :  inventory_nvidia_smi_gpuutil,
    "service_description" :  "%s utilization",
    "has_perfdata" :        True,
    "group" :               "nvidia_smi"
}
check_info['nvidia_smi.memutil'] = {
    "check_function" :      check_nvidia_smi_memutil,
    "inventory_function" :  inventory_nvidia_smi_memutil,
    "service_description" :  "%s memory",
    "has_perfdata" :        True,
    "group" :               "nvidia_smi"
}
check_info['nvidia_smi.temp'] = {
    "check_function" :      check_nvidia_smi_temp,
    "inventory_function" :  inventory_nvidia_smi_temp,
    "service_description" :  "%s temperature",
    "has_perfdata" :        True,
    "group" :               "nvidia_smi"
}
check_info['nvidia_smi.power'] = {
    "check_function" :      check_nvidia_smi_power,
    "inventory_function" :  inventory_nvidia_smi_power,
    "service_description" :  "%s power",
    "has_perfdata" :        True,
    "group" :               "nvidia_smi"
}

jens-maus · January 12, 2024, 2:46pm

That’s of course not what you should do. You need two modifications/files as listed in my initial post. One modification to the checkmk installation files on the server and then you need that nvidia-smi.sh script in the client with only the lines that are posted. However, I wonder how you have currently setup NVIDIA GPU monitoring if you do not have a local file at the client atm.

yetanotheraccount · January 12, 2024, 3:17pm

Sorry - there was a typo in my last post (top lines). I fixed it.
Of course I had a local file at the client.

yetanotheraccount · January 12, 2024, 3:21pm

This is just for informational purpose:

root@nvidia-server:/# nvidia-smi

Fri Jan 12 16:19:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:05:00.0 Off |                  Off |
|  0%   35C    P8              14W / 450W |   5434MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000               Off | 00000000:09:00.0 Off |                  Off |
| 41%   35C    P8              18W / 140W |   5209MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A4000               Off | 00000000:0A:00.0 Off |                  Off |
| 41%   34C    P8              17W / 140W |   6639MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    977458      C   python                                     5422MiB |
|    1   N/A  N/A    977459      C   python                                     5198MiB |
|    2   N/A  N/A      4222      C   python                                     6628MiB |
+---------------------------------------------------------------------------------------+

DoctorSchnell · January 27, 2024, 8:43pm

Just to save someone else the time: if you’re trying to get this to work on a Windows host instead of Linux, the client plugin .ps1 file (I added it to C:\ProgramData\checkmk\agent\plugins, downloaded from http://checkmk-hostname/monitoring/check_mk/agents/windows/plugins/), should look like this:

$CMK_VERSION = "2.2.0p16"

Write-Host "<<<nvidia_smi:sep(9)>>>"
& "C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" -q -x | %{$_ -replace "gpu_power_readings","power_readings"}

The main change was using the -replace command instead of sed. Also, make sure you unblock the file after you download it. Right click it → properties, then check the box to unblock and hit apply. Rescan the services for the host and they should come right up.

plauk-emag · April 3, 2024, 5:22am

Did anyone get this solved? I try to use the nvidia checks and haven’t done so before. I’m at Checkmk version 2.2.0p22 and get this error at the inventory scan:

Starting job...
WARNING: Parsing of section nvidia_smi failed - please submit a crash report! (Crash-ID: 308e91a8-f175-11ee-97b0-19d5cc33223e)
Completed.

When I run the plugin itself (Windows machine, C:\ProgramData\checkmk\agent\plugins\nvidia_smi.ps1) or download the agent output I get a lot of lines from the plugin like this, the whole nvidia-smi block is 1702 lines long:

<<<nvidia_smi:sep(9)>>>
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
<nvidia_smi_log>
	<timestamp>Wed Apr  3 06:40:02 2024</timestamp>
	<driver_version>551.52</driver_version>
	<cuda_version>12.4</cuda_version>
	<attached_gpus>2</attached_gpus>
	<gpu id="00000000:15:00.0">
		<product_name>Quadro RTX 5000</product_name>
		<product_brand>Quadro RTX</product_brand>
		<product_architecture>Turing</product_architecture>
		<display_mode>Disabled</display_mode>
		<display_active>Disabled</display_active>
		<persistence_mode>N/A</persistence_mode>
		<addressing_mode>N/A</addressing_mode>
		<mig_mode>
			<current_mig>N/A</current_mig>
			<pending_mig>N/A</pending_mig>
		</mig_mode>
		<mig_devices>
			None
		</mig_devices>
		<accounting_mode>Disabled</accounting_mode>
		<accounting_mode_buffer_size>4000</accounting_mode_buffer_size>
		<driver_model>
			<current_dm>WDDM</current_dm>
			<pending_dm>WDDM</pending_dm>
		</driver_model>

Does anyone have a suggestion on how to fix this? Thank you!

andreas-doehler · April 3, 2024, 6:44am

The fix is in the post from @DoctorSchnell before your post.
Only important thing is the replacing for the “gpu_power_reading” with “power_reading”.

plauk-emag · April 3, 2024, 7:32am

Unfortunately it didn’t fix the issue for me. The agent output now has correct block

<power_readings>
			<power_state>P8</power_state>
			<power_draw>18.32 W</power_draw>
			<current_power_limit>230.00 W</current_power_limit>
			<requested_power_limit>230.00 W</requested_power_limit>
			<default_power_limit>230.00 W</default_power_limit>
			<min_power_limit>125.00 W</min_power_limit>
			<max_power_limit>230.00 W</max_power_limit>
		</power_readings>

but the inventory check still reports the issue as described before.