Error on special agent agent_proxmox_ve when node is missing in cluster

mensinck · January 14, 2023, 2:15pm

CMK version: 2.1.0p19.cre
OS version: proxmox VE 7.3 / debian buster

Error message: [special_proxmox_ve] Agent exited with code 1: Caught unhandled KeyError(‘timezone’) in /omd/sites/mc/lib/python3/cmk/special_agents/utils/agent_common.py:135(!!)

Output of “cmk --debug -vvn hostname”: [agent] Success, [special_proxmox_ve] Agent exited with code 1: Caught unhandled KeyError(‘timezone’) in /omd/sites/mc/lib/python3/cmk/special_agents/utils/agent_common.py:135(!!), execution time 45.6 sec | execution_time=45.610 user_time=0.220 system_time=0.040 children
_user_time=1.080 children_system_time=0.170 cmk_time_agent=8.760 cmk_time_ds=35.330 (If it is a problem with checks or plugins)

The pescial agent agent_proxmox_ve produces this error as soon as one node of cluster is missing. When all nodes are up and running the plugin works as expected.

The error can be reproduces alling the plugin standalon on cli.

NFO 2023-01-14 15:08:28 root: running file /omd/sites/mc/lib/python3/cmk/special_agents/utils/agent_common.py
INFO 2023-01-14 15:08:28 root: using Python interpreter v3.9.10.final.0 at /omd/sites/mc/bin/python3
INFO 2023-01-14 15:08:28 agent_proxmox_ve: Establish connection to Proxmox VE host '***********'
INFO 2023-01-14 15:08:28 agent_proxmox_ve: Fetch general cluster and node information..
INFO 2023-01-14 15:08:48 agent_proxmox_ve: Fetch and process backup logs..
INFO 2023-01-14 15:08:48 agent_proxmox_ve: BackupTask('vzdump', t='2023.01.14-00:00:02', vms=('210', '1211102'))
... some more ,,BackupTask" lines ...

then the plugin terminates with:

Traceback (most recent call last):
  File "/opt/omd/versions/2.1.0p19.cre/share/check_mk/agents/special/./agent_proxmox_ve", line 10, in <module>
    main()
  File "/omd/sites/mc/lib/python3/cmk/special_agents/agent_proxmox_ve.py", line 883, in main
    special_agent_main(parse_arguments, agent_proxmox_ve_main)
  File "/omd/sites/mc/lib/python3/cmk/special_agents/utils/agent_common.py", line 161, in special_agent_main
    _special_agent_main_core(parse_arguments, main_fn, argv or sys.argv[1:])
  File "/omd/sites/mc/lib/python3/cmk/special_agents/utils/agent_common.py", line 135, in _special_agent_main_core
    main_fn(args)
  File "/omd/sites/mc/lib/python3/cmk/special_agents/agent_proxmox_ve.py", line 533, in agent_proxmox_ve_main
    node_timezones[node["node"]] = node["time"]["timezone"]
KeyError: 'timezone'

Since we have a cluster with some ,standby" nodes on proxmox we will ,hopfully" allways have nodes down in the cluster. Unfortunalty the plugin is not usable for this scenario. Is there any option to allow nodes down in clusters?

Regards and thanks for advise
Lukas

hafnix · January 23, 2023, 12:34pm

Hi, I have exactly the same problem, also with CMK 2.1.0p19 and Proxmox VE 7.3-4. The error was there also with 2.1.0p18.

I am running a 2-node-cluster with a qdevice on a RaspberryPi as third quorum vote. As soon as I switch off one of the PVE nodes, the “[special_proxmox_ve] Agent” dies with KeyError(‘timezone’). Same error-message as above. If the 2nd node is up again, there is no problem anymore.

CMK is running in a Debian-VM on one of the nodes. I checked the timezone-settings of all my nodes and devices, it’s set to Europe/Berlin everywhere.

If it’s not a bug, then maybe some misconfiguration on my side? I followed exactly the steps in https://checkmk.com/de/blog/proxmox-monitoring

Thanks for any help and kind regards,
Michael

leo · January 23, 2023, 12:52pm

Hi, we had the same problem.
Seems to be an error in agent_proxmox_ve.py reading out the timezone attribute for a host which is currently down.
The fix seems to be to put an additional if condition to check for the timezone attribute before getting the value.
I created a small patch to fix the issue:

--- /omd/sites/cmk/lib/python3/cmk/special_agents/agent_proxmox_ve.py	2023-01-23 11:30:50.516960143 +0100
+++ /omd/sites/cmk/lib/python3/cmk/special_agents/agent_proxmox_ve.py	2023-01-23 11:31:22.000314799 +0100
@@ -528,12 +528,13 @@
     snapshot_data = {}
 
     for node in data["nodes"]:
-        node_timezones[node["node"]] = node["time"]["timezone"]
-        # only lxc and qemu can have snapshots
-        for vm in node.get("lxc", []) + node.get("qemu", []):
-            snapshot_data[str(vm["vmid"])] = {
-                "snaptimes": [x["snaptime"] for x in vm["snapshot"] if "snaptime" in x],
-            }
+        if "timezone" in node["time"]:
+            node_timezones[node["node"]] = node["time"]["timezone"]
+            # only lxc and qemu can have snapshots
+            for vm in node.get("lxc", []) + node.get("qemu", []):
+                snapshot_data[str(vm["vmid"])] = {
+                    "snaptimes": [x["snaptime"] for x in vm["snapshot"] if "snaptime" in x],
+                }

After the change the check started working again.
Therefore i will also create a pull request on the official checkmk github page to rectify this problem.

hafnix · January 23, 2023, 1:27pm

I have applied your patch and can confirm that it’s working and fixed the issue.

Thank you very much!

mensinck · January 24, 2023, 10:59am

I also applied the patch and can confirm it’s working.

This is also valid for clusters with nodes switched off

Thanks for your work

system · January 24, 2024, 11:00am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.