After upgrade to 2.2.0p12: Parsing of section nvidia_smi failed - please submit a crash report!

I just ran into the same issue recently after having setup a new server with latest nvidia drivers. After some investigation I think the issue seems to stem from changed xml output of the nvidia-smi tool because in our case just the newer server with latest nvidia drivers (535.129.03) are affected by this checkmk parsing/crash issue.

In fact, after some deeper investigation I could fix the issue locally here by applying two modifications:

  1. On the CheckMK server: modify /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py:
--- /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py.orig 2023-10-16 00:58:49.000000000 +0200
+++ /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py	2024-01-11 10:21:04.515439934 +0100
@@ -160,7 +160,7 @@
                         get_text_from_element(gpu.find("power_readings/power_state")),
                     ),
                     power_management=PowerManagement(
-                        get_text_from_element(gpu.find("power_readings/power_management"))
+                        get_text_from_element(gpu.find("power_readings/power_management")) if get_text_from_element(gpu.find("power_readings/power_management")) is not None else "Supported"
                     ),
                     power_draw=get_float_from_element(gpu.find("power_readings/power_draw"), "W"),
                     power_limit=get_float_from_element(gpu.find("power_readings/power_limit"), "W"),
  1. On the CheckMK client: modify the /usr/lib/check_mk_agent/plugins/nvidia_smi.sh to look like:
#!/bin/sh
echo "<<<nvidia_smi:sep(9)>>>"
/usr/bin/nvidia-smi -q -x | sed 's/gpu_power_readings/power_readings/'

So the issue seems to be twofold: 1. the nvidia-smi output was changed so that the whole <power_management> tag seems to be missing/not available under the power readings xml branch. and 2. the previously called <power_readings> xml branch is now called <gpu_power_readings>. Therefore, the output of nvidia-smi -q -x does not match the expectations in the nvidia_smi.py plugin in the latest CheckMK versions. However, the above modifications should solve these issues. At least it did here.