Update from 2.0.0P22 to 2.1 | Missing monitoring data for plugins: wmi_cpuload

I’m a facts / numbers guy myself, so after being annoyed with the constant yellow messages I’ve checked my logs because I wanted to find out how often it happens and whether there is a pattern:

OMD[mysite]:~$ grep "wmi_cpuload" var/check_mk/core/history | grep -o "SERVICE ALERT:.*Check_MK;WARN" |  sort | uniq -c | sort -n
      1 SERVICE ALERT: server022;Check_MK;WARN
      1 SERVICE ALERT: server009;Check_MK;WARN
      1 SERVICE ALERT: server036;Check_MK;WARN
      1 SERVICE ALERT: server065;Check_MK;WARN
      1 SERVICE ALERT: server01;Check_MK;WARN
      1 SERVICE ALERT: server04;Check_MK;WARN
      2 SERVICE ALERT: server067;Check_MK;WARN
      3 SERVICE ALERT: server072;Check_MK;WARN
      3 SERVICE ALERT: server03;Check_MK;WARN
    408 SERVICE ALERT: server049;Check_MK;WARN
    461 SERVICE ALERT: server073;Check_MK;WARN
    477 SERVICE ALERT: servers01;Check_MK;WARN
    482 SERVICE ALERT: server039;Check_MK;WARN
    492 SERVICE ALERT: server011;Check_MK;WARN
    493 SERVICE ALERT: server018;Check_MK;WARN
    497 SERVICE ALERT: server013;Check_MK;WARN
    497 SERVICE ALERT: server062;Check_MK;WARN
    498 SERVICE ALERT: server041;Check_MK;WARN
    500 SERVICE ALERT: server034;Check_MK;WARN
    500 SERVICE ALERT: server070;Check_MK;WARN
    519 SERVICE ALERT: server06;Check_MK;WARN

This particular site contains about 50 Windows hosts and one can clearly see that 12 of them are (badly) affected (the outlier with “only” 408 events was patched to 2.1 yesterday already for different reasons, therefore no longer any WARNINGs after that).

Unfortunately I don’t see any pattern as to the specific servers affected (and unaffected), other than it’s only happening with 1.6.0 and 2.0 agents, not earlier ones (did the old agent even implement such a check?). Who knows which spurious Windows Registry bit bothers the WMI service on those hosts, the WARNING in Checkmk likely only highlights an issue that has always been there.

Good news to hear that the “wmi_cpuload” does no longer use WMI in 2.1 agents after all.
Updating the agents to 2.1 fixes the issue, but there is obviously a server-side change in 2.1 as root cause of the issue, as nobody changed anything on the agents to make the check break and CMC 2.0 was content with the results of the same agents, while 2.1 sometimes isn’t.

I’ve downloaded the new agents from the bakery and updated the 12 servers manually and now expect only sporadic occurrences of this in my logs (as can be seen with the amount of agents where the error happened exactly once), I can totally live with that. The sporadic entries should cease to occur once I’ve managed to update all of the remaining Windows servers to 2.1

1 Like

I’m having the same WMI messages since i upgraded the OMD and the agents to 2.1.0p3- then updated to 2.1.0p8 and i still have too many warnings from windows servers every day.

@keren Try the latest release, my agent and omd version is 2.1.0p9 and the issue is resolved for me

Thanks Lasse, i will do that.

1 Like

Hi Guys, problem still exist even after upgrading to 2.1.0p11 and also Appling the rule: [Disabled sections (Windows agent)]
any suggestions?

If you disable the section inside the agent and don’t do a discovery on the affected hosts the message will stay the same as before. On the affected hosts the wmi_cpuload service should be no shown as vanished at discovery time.

Same issue with 2.1.0p13. Agent and Server are using p13 and i get missing wmi_cpuload in piggyback for some of the Hyper-V Clients. There must be an issue within checkMK itself.

I have to emphasize this: The issue lies within Windows or WMI more specifically.
We are doing our very best to properly monitor metrics, we only get through WMI, but it is a pain.
There might always be room for improvement on our end, but again: We are working around issues in WMI and we can only do so much.

The problem is it was not there before 2.1. We changed dashboard now from “service states” to “service hard states” and set hard state limit to 3 so it can fail 2 times till we get this warning, cause it is always missing just once. The next agent call wmi_cpuload (and it is ALWAYS only the cpuload) will be back. Regardless if it is direct on the server OR as piggyback. Problem now is we get warning for more critical services delayed by 3 minutes.

It looks really like a problem with the data for the WMI checks how it is processed in CMK 2.1.
This is no agent problem as the data for the check is reported.
But CMK has, under unknown circumstances, a problem to process the received data for some WMI checks correctly.

@SergejKipnis maybe you can add something of substance here? I got nothing to be honest.

I can’t imagine how you could get missing cpu_load in 2.1p13.
This version uses performance counters, and performance counters (if found) are quite stable.
Maybe Windows Agent can’t find required performance counters? This is possible, at least theoretically.

I need the log from the Windows agent. If possible, MSI + zipped Programdata/checkmk to see what happened.

Just FYI, piggyback, dashboard hard state limit doesn’t imply on Windows agent functionality:
You may easily validate how good is output from wi_cpuload running
check_mk_service section wmi_cpuload
Expected output with performance counters

<<<wmi_cpuload:sep(124)>>>
[system_perf]
Name|ProcessorQueueLength|Timestamp_PerfTime|Frequency_PerfTime|WMIStatus
|0|1725542868066|10000000|OK
[computer_system]
Name|NumberOfLogicalProcessors|NumberOfProcessors|WMIStatus
KLAPP-0336|20|1|OK```

You mean, that despite agent did deliver the data, the check for some unknown reasons may not process correctly?

Exactly, this happens very often for older agents if you upgrade to 2.1.

Can confirm the Problem here too, 1.4.0 Agent working fine in a 1.5.0 installation, but in the new 2.1, where we not have updated the agents yet (old site also still running fine), we get the missing agent section for wmi_cpuload over and over.

1 Like

We have the same problem in our environment with checkmk 2.1 and different older Agent versions (1.2/1.4/1.6).
The error stops when the agents are updated to 2.1, the problems started right after the 2.1 upgrade.

Could I get an output from your agent?

Is it possible to obtain Windows agent output?

Hey Sergej,

in fact, it’s a bug in the Agent, just CMK handles it different in 2.1 it seems.
Always when it’s happens, the Section is missing all the Information, when you check then again, the information is back.

This is how the Section looks, when the failure appears:

<<<wmi_cpuload:sep(44)>>>
[system_perf]
WMItimeout
[computer_system]
WMItimeout

So, there are Timeouts…

Today i had some systems with this problem and only the section [system_perf] was empty. Computer system was working as expected.
In search for a solution i also had a look at the check - wmi_cpuload.py and found some strang things.
Why is there no real error handling inside the parsing function? It returns empty output but the section is not empty.

For all who want a small workaround.
It only give the error if the computer_system table is empty.

Original

    try:
        load = wmi_tables["system_perf"].get(0, "ProcessorQueueLength")
        timestamp = get_wmi_time(wmi_tables["system_perf"], 0)
        computer_system = wmi_tables["computer_system"]
    except (KeyError, WMIQueryTimeoutError):
        return None
    assert load

changed version

    try:
        load = wmi_tables["system_perf"].get(0, "ProcessorQueueLength")
    except (KeyError, WMIQueryTimeoutError):
        load = 0.0

    try:
        timestamp = get_wmi_time(wmi_tables["system_perf"], 0)
    except (KeyError, WMIQueryTimeoutError):
        timestamp = 0.0

    try:
        computer_system = wmi_tables["computer_system"]
    except (KeyError, WMIQueryTimeoutError):
        return None
1 Like