CMK version:2.2.0p5 Enterprise Edition OS version: Debian 11
Hey people,
We have invested a lot of time to improve the performance in checkmk. The checkmk is even relatively small with 4000 services.
We need a service check rate of 46 checks/s but the checkMK often cannot achieve this. This often leads to incorrect messages. A host is reported as down even though it is not.
We have already adjusted and tested a few values in the global settings but nothing solves the problem.
Our settings:
Maximum concurrent active checks 100
Maximum concurrent real-time checks 12
Maximum concurrent Livestatus connections 20
Maximum concurrent Checkmk fetchers 13
Maximum concurrent Checkmk checkers 12
Unfortunately, the standard settings don’t help here.
If anyone has an idea or improvement, please share it.
Checkers should be about the number of physical cores, so 12 might be good. 4000 services implies 80 to 200 hosts. If some of those are slow, they block fetchers, so double the number of fetchers as a start.
And please update to the latest patch level, there have been performance improvements.
can you please share the Checkmk server specs? (CPU, RAM, SSD or HDD)
4000 services is nothing for a regular Checkmk installation and should not lead to performance issues. That’s why it is important to know which resources are available to the Checkmk server.
The Server is a 12 core VM with 8 GB Ram. 5 GB is used. The Host system runs with SSD drives. The CPU is an old one. Intel Xeon 2 Ghz E5 2650.
Could that be the problem?
If the CPU utilization is constantly close to 100%, of course. But I doubt this.
Have a look at the graphs for checker/fetcher utilization (Service graphs OMD sitename performance) to find out if the bottleneck is there. You might post screenshots of these graphs here.
Look directly at the graphs to see the historic fetcher and checker usage.
But i doubt you will have to scale checkers. From my experience, 2-6 for your size should be sufficient… the fetchers however you could increase - though they shouldnt have high impact on load.
Can you check what processes consume so much on that machine?
The checker and fetcher usage is always <1%
i have updated checkmk to 2.2.0p12 and changed the fetcher and active checks value much higher. I have to check if there are still false positive messages.
But i am still wondering that the server could not always reach 100% to reach the service check rate.
So, performance wise everything looks good, but still host occasionally go down? Checkmk uses “smart ping” to check whether hosts can be reached. Should be not a big deal, just more ICMP traffic than normal. However, we have seen several environments, where virtualization was responsible for the problems handling massive interrupts due to smart ping, like an emulated Intel e1000. So please check the emulated ethernet adaptors used.
And another hint from a colleague: Please check if you can ping the hosts in question from the Checkmk server. On Windows, ICMP echo is off as default. If smart ping fails, but agent output can be retrieved, Checkmk (correctly) assumes a host to be up. But if for some reason agent output is delayed, these hosts might come down for one check interval.
Checkmk - especially in the enterprise editions with the Checkmk Micro Core - performs exceptionally. There are customers out there monitoring between 500,000 and 1,000,000 services with thousands of hosts on a single site. That being said, it is extremely unlikely, that your performance issues come from the capabilities of Checkmk itself. Experience suggests, it has to be a configuration issue, or an underlying problem.
You can check the logs in $OMD_ROOT/var/log/, maybe they can give you a hint. If possible open a support ticket with our support. This is probably something that is easy to fix, once one of our engineers can get a look at the environment.