I’m a facts / numbers guy myself, so after being annoyed with the constant yellow messages I’ve checked my logs because I wanted to find out how often it happens and whether there is a pattern:
OMD[mysite]:~$ grep "wmi_cpuload" var/check_mk/core/history | grep -o "SERVICE ALERT:.*Check_MK;WARN" | sort | uniq -c | sort -n
1 SERVICE ALERT: server022;Check_MK;WARN
1 SERVICE ALERT: server009;Check_MK;WARN
1 SERVICE ALERT: server036;Check_MK;WARN
1 SERVICE ALERT: server065;Check_MK;WARN
1 SERVICE ALERT: server01;Check_MK;WARN
1 SERVICE ALERT: server04;Check_MK;WARN
2 SERVICE ALERT: server067;Check_MK;WARN
3 SERVICE ALERT: server072;Check_MK;WARN
3 SERVICE ALERT: server03;Check_MK;WARN
408 SERVICE ALERT: server049;Check_MK;WARN
461 SERVICE ALERT: server073;Check_MK;WARN
477 SERVICE ALERT: servers01;Check_MK;WARN
482 SERVICE ALERT: server039;Check_MK;WARN
492 SERVICE ALERT: server011;Check_MK;WARN
493 SERVICE ALERT: server018;Check_MK;WARN
497 SERVICE ALERT: server013;Check_MK;WARN
497 SERVICE ALERT: server062;Check_MK;WARN
498 SERVICE ALERT: server041;Check_MK;WARN
500 SERVICE ALERT: server034;Check_MK;WARN
500 SERVICE ALERT: server070;Check_MK;WARN
519 SERVICE ALERT: server06;Check_MK;WARN
This particular site contains about 50 Windows hosts and one can clearly see that 12 of them are (badly) affected (the outlier with “only” 408 events was patched to 2.1 yesterday already for different reasons, therefore no longer any WARNINGs after that).
Unfortunately I don’t see any pattern as to the specific servers affected (and unaffected), other than it’s only happening with 1.6.0 and 2.0 agents, not earlier ones (did the old agent even implement such a check?). Who knows which spurious Windows Registry bit bothers the WMI service on those hosts, the WARNING in Checkmk likely only highlights an issue that has always been there.
Good news to hear that the “wmi_cpuload” does no longer use WMI in 2.1 agents after all.
Updating the agents to 2.1 fixes the issue, but there is obviously a server-side change in 2.1 as root cause of the issue, as nobody changed anything on the agents to make the check break and CMC 2.0 was content with the results of the same agents, while 2.1 sometimes isn’t.
I’ve downloaded the new agents from the bakery and updated the 12 servers manually and now expect only sporadic occurrences of this in my logs (as can be seen with the amount of agents where the error happened exactly once), I can totally live with that. The sporadic entries should cease to occur once I’ve managed to update all of the remaining Windows servers to 2.1
If you disable the section inside the agent and don’t do a discovery on the affected hosts the message will stay the same as before. On the affected hosts the wmi_cpuload service should be no shown as vanished at discovery time.
I have to emphasize this: The issue lies within Windows or WMI more specifically.
We are doing our very best to properly monitor metrics, we only get through WMI, but it is a pain.
There might always be room for improvement on our end, but again: We are working around issues in WMI and we can only do so much.
The problem is it was not there before 2.1. We changed dashboard now from “service states” to “service hard states” and set hard state limit to 3 so it can fail 2 times till we get this warning, cause it is always missing just once. The next agent call wmi_cpuload (and it is ALWAYS only the cpuload) will be back. Regardless if it is direct on the server OR as piggyback. Problem now is we get warning for more critical services delayed by 3 minutes.
It looks really like a problem with the data for the WMI checks how it is processed in CMK 2.1.
This is no agent problem as the data for the check is reported.
But CMK has, under unknown circumstances, a problem to process the received data for some WMI checks correctly.
I can’t imagine how you could get missing cpu_load in 2.1p13.
This version uses performance counters, and performance counters (if found) are quite stable.
Maybe Windows Agent can’t find required performance counters? This is possible, at least theoretically.
I need the log from the Windows agent. If possible, MSI + zipped Programdata/checkmk to see what happened.
Just FYI, piggyback, dashboard hard state limit doesn’t imply on Windows agent functionality:
You may easily validate how good is output from wi_cpuload running check_mk_service section wmi_cpuload
Expected output with performance counters
Can confirm the Problem here too, 1.4.0 Agent working fine in a 1.5.0 installation, but in the new 2.1, where we not have updated the agents yet (old site also still running fine), we get the missing agent section for wmi_cpuload over and over.
We have the same problem in our environment with checkmk 2.1 and different older Agent versions (1.2/1.4/1.6).
The error stops when the agents are updated to 2.1, the problems started right after the 2.1 upgrade.
in fact, it’s a bug in the Agent, just CMK handles it different in 2.1 it seems.
Always when it’s happens, the Section is missing all the Information, when you check then again, the information is back.
This is how the Section looks, when the failure appears:
Today i had some systems with this problem and only the section [system_perf] was empty. Computer system was working as expected.
In search for a solution i also had a look at the check - wmi_cpuload.py and found some strang things.
Why is there no real error handling inside the parsing function? It returns empty output but the section is not empty.
For all who want a small workaround.
It only give the error if the computer_system table is empty.