I just ran into the same issue recently after having setup a new server with latest nvidia drivers. After some investigation I think the issue seems to stem from changed xml output of the nvidia-smi tool because in our case just the newer server with latest nvidia drivers (535.129.03) are affected by this checkmk parsing/crash issue.
In fact, after some deeper investigation I could fix the issue locally here by applying two modifications:
- On the CheckMK server: modify
/omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py:
--- /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py.orig 2023-10-16 00:58:49.000000000 +0200
+++ /omd/sites/.../lib/python3/cmk/base/plugins/agent_based/nvidia_smi.py 2024-01-11 10:21:04.515439934 +0100
@@ -160,7 +160,7 @@
get_text_from_element(gpu.find("power_readings/power_state")),
),
power_management=PowerManagement(
- get_text_from_element(gpu.find("power_readings/power_management"))
+ get_text_from_element(gpu.find("power_readings/power_management")) if get_text_from_element(gpu.find("power_readings/power_management")) is not None else "Supported"
),
power_draw=get_float_from_element(gpu.find("power_readings/power_draw"), "W"),
power_limit=get_float_from_element(gpu.find("power_readings/power_limit"), "W"),
- On the CheckMK client: modify the
/usr/lib/check_mk_agent/plugins/nvidia_smi.shto look like:
#!/bin/sh
echo "<<<nvidia_smi:sep(9)>>>"
/usr/bin/nvidia-smi -q -x | sed 's/gpu_power_readings/power_readings/'
So the issue seems to be twofold: 1. the nvidia-smi output was changed so that the whole <power_management> tag seems to be missing/not available under the power readings xml branch. and 2. the previously called <power_readings> xml branch is now called <gpu_power_readings>. Therefore, the output of nvidia-smi -q -x does not match the expectations in the nvidia_smi.py plugin in the latest CheckMK versions. However, the above modifications should solve these issues. At least it did here.