CMK version: Checkmk Raw Edition 2.2.0p25 OS version: Ubuntu 22.04.4 LTS
Hello,
I have run CMK 2.0.0 for years with okay performance. I have now upgraded first to 2.1.0 which impacted the performance already a good bit and a day later upgraded to 2.2.0 which made the monitoring server completely unusable. The performance is so poor that almost all checks time out and the Service Speed-o-Meter is almost always at or close to zero. The sysload is often beyond 1000%, sometimes reaching 2000%.
The monitoring VM has 5 CPU cores and 4 GB RAM and monitors 139 hosts and 1931 checks. There has never been any significant performance issue in the last years. I have upgraded to 8 cores and 8 GB RAM, but this had no impact whatsoever.
The scheduled service check rate is 29.2 checks per second. The majority of checks is stale permanently.
As of now we have no monitoring as just nothing works anymore.
Is there any solution to this or should we just revert back to 2.1.0 and leave it again for a few years until a more stable version of CMK is released?
This is p25, not p2. So I doubt this is a general problem with the 2.2 version, which has by and large be pretty stable for a while now. It’s also, as far as I remember, the first time I hear about massive performance problems with this version. There must be something specific to your environment that is causing this. So it’s unlikely that a p26 will solve this for you.
I am sure the troubleshooting experts here are going to comment on this soon (well, it’s a public holiday in Germany where most of them are, but soon-ish, I guess). But rolling back to 2.1 and then going on a fact-finding mission to find out what’s causing this seems like the best course of action.
(One thing you could try is go to an earlier patch release of 2.2, like p23, to see if we maybe accidentally introduced a problem with the newest release…)
After letting this rest now for a bit with the upgraded VM ressources, I am seeing the RAM usage is around 5 GB and the sysload around 250%.
So I guess with 2000 checks 12 cores and 6-8 GB RAM would be an appropriate configuration.
Seems like the system requirements have just increased far enough that the old system specs from running version 2.0.0 is just not powerful enough to run 2.2.0 at all.
Increasing the ressources made the issues largely go away after a bit over 30 minutes. There are still a few stale services which come and go, but that is okay for me.
So I consider this as resolved with the solution to double VM ressources when upgrading from 2.0.0 to 2.2.0 and keep an eye on the CPU and memory usage to dial it in properly.
That’s not the case. There are some important points you should check. Also if you say you can live with the current state.
Check interval for the “Check_MK discovery” and “Check_MK HW/SW inventory” service. If you use booth functions. I had seen systems, where the rule, that these services should only be executed every 6 or 12 hours was broken.
Second important point is the runtime for the “Check_MK” service itself on your hosts. Do you see there times higher than some seconds? If yes, you need to inspect these hosts.
In conjunction to the problem with the check interval of the discovery and hardware inventory service, you need to know that after the update all these heavy load checks where scheduled to run from the Nagios core.
On component that needs more CPU power mainly is the Apache in the newer versions.
The automatic service discovery is running every two hours and the HW/SW Inventory is running once a day.
The CMK service on the hosts itself runs on the host with the most plugins and stuff going on just under 5 seconds. The other hosts are around 3.5 seconds.
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.