Powermonitoring NVIDIA_GPU did not work anymore with 535.xx

Hello all,

i used the Plugin NVIDIA-GPU from Smraju a about an month. Great plugin, monitors temperature, GPU, fan, power and memory. Since the update to Nvidia driver 535.54.03, the monitoring of the power consumption unfortunately no longer works. An update would be necessary.

But probably it would be enough to rewrite the agent plugin. Here I have the error message.

PythonArgs: ['./nvidia-smi']
Traceback:
 Traceback (most recent call last):
   File "/usr/lib/check_mk_agent/plugins/./nvidia-smi", line 86, in <module>
     power_draw = float(gpu
 IndexError: list index out of range

In the code itself the whole thing looks like this:

power_draw = float(gpu
                    .getElementsByTagName("power_readings")[0]
                    .getElementsByTagName("power_draw")[0]
                    .childNodes[0].data.split()[0])

    power_limit = float(gpu
                    .getElementsByTagName("power_readings")[0]
                    .getElementsByTagName("power_limit")[0]
                    .childNodes[0].data.split()[0])

I have temporarily removed the two digits, so all other checks work normally.

Can someone help me with this? :roll_eyes: :grinning: :pray:

It seems that the xml-output has changed of the nvidia-smi tool.
Old style: (server not updated)

<power_readings>
                        <power_state>P8</power_state>
                        <power_management>Supported</power_management>
                        <power_draw>18.44 W</power_draw>
                        <power_limit>230.00 W</power_limit>
                        <default_power_limit>230.00 W</default_power_limit>
                        <enforced_power_limit>230.00 W</enforced_power_limit>
                        <min_power_limit>100.00 W</min_power_limit>
                        <max_power_limit>230.00 W</max_power_limit>
                </power_readings>

New style:

<gpu_power_readings>
                        <power_state>P0</power_state>
                        <power_draw>28.48 W</power_draw>
                        <current_power_limit>70.00 W</current_power_limit>
                        <requested_power_limit>70.00 W</requested_power_limit>
                        <default_power_limit>70.00 W</default_power_limit>
                        <min_power_limit>60.00 W</min_power_limit>
                        <max_power_limit>70.00 W</max_power_limit>
                </gpu_power_readings>

So if you change it as below then all the elements have their correct names:
(small change: max_power_limit is used as power_limit is not existing any more)

#    power_draw = float(gpu
#                    .getElementsByTagName("power_readings")[0]
#                    .getElementsByTagName("power_draw")[0]
#                    .childNodes[0].data.split()[0])

    power_draw = float(gpu
                    .getElementsByTagName("gpu_power_readings")[0]
                    .getElementsByTagName("power_draw")[0]
                    .childNodes[0].data.split()[0])

#    power_limit = float(gpu
#                    .getElementsByTagName("power_readings")[0]
#                    .getElementsByTagName("power_limit")[0]
#                    .childNodes[0].data.split()[0])

    power_limit = float(gpu
                    .getElementsByTagName("gpu_power_readings")[0]
                    .getElementsByTagName("max_power_limit")[0]
                    .childNodes[0].data.split()[0])
1 Like

I had to downgrade the driver in the meantime due to problems. As soon as the error is fixed, I’ll test it right away.

Thanks a lot for your help. :pray: :sunglasses:

You are right, really small changes. It is working perfectly. Very thanks :star_struck: