CheckMK Performance

Christian1 · October 18, 2023, 12:31pm

CMK version:2.2.0p5 Enterprise Edition
OS version: Debian 11

Hey people,

We have invested a lot of time to improve the performance in checkmk. The checkmk is even relatively small with 4000 services.
We need a service check rate of 46 checks/s but the checkMK often cannot achieve this. This often leads to incorrect messages. A host is reported as down even though it is not.
We have already adjusted and tested a few values in the global settings but nothing solves the problem.

Our settings:

Maximum concurrent active checks 100

Maximum concurrent real-time checks 12

Maximum concurrent Livestatus connections 20

Maximum concurrent Checkmk fetchers 13

Maximum concurrent Checkmk checkers 12

Unfortunately, the standard settings don’t help here.

If anyone has an idea or improvement, please share it.

mschlenker · October 18, 2023, 12:43pm

Checkers should be about the number of physical cores, so 12 might be good. 4000 services implies 80 to 200 hosts. If some of those are slow, they block fetchers, so double the number of fetchers as a start.

And please update to the latest patch level, there have been performance improvements.

Norm · October 18, 2023, 12:45pm

Hi @Christian1

can you please share the Checkmk server specs? (CPU, RAM, SSD or HDD)

4000 services is nothing for a regular Checkmk installation and should not lead to performance issues. That’s why it is important to know which resources are available to the Checkmk server.

Thanks in advance!

Norm

Christian1 · October 18, 2023, 1:11pm

Thanks, i tested the checkers with 25 and more.

If the server completes all checks in time, the scheduled check rate is always >100% or not?
Because at the moment it oscillates between 70% and 114%.

mschlenker · October 18, 2023, 1:13pm

Have a look at fetchers, these are more often the bottleneck and depend extremely on your environment (network, slow check plugins…).

Christian1 · October 18, 2023, 1:20pm

The Server is a 12 core VM with 8 GB Ram. 5 GB is used. The Host system runs with SSD drives. The CPU is an old one. Intel Xeon 2 Ghz E5 2650.
Could that be the problem?

mschlenker · October 18, 2023, 1:29pm

If the CPU utilization is constantly close to 100%, of course. But I doubt this.

Have a look at the graphs for checker/fetcher utilization (Service graphs OMD sitename performance) to find out if the bottleneck is there. You might post screenshots of these graphs here.

Christian1 · October 18, 2023, 1:39pm

The Serverload is constant on 2,5

martin.hirschvogel · October 19, 2023, 4:25am

Look directly at the graphs to see the historic fetcher and checker usage.
But i doubt you will have to scale checkers. From my experience, 2-6 for your size should be sufficient… the fetchers however you could increase - though they shouldnt have high impact on load.
Can you check what processes consume so much on that machine?

Christian1 · October 19, 2023, 12:14pm

The checker and fetcher usage is always <1%
i have updated checkmk to 2.2.0p12 and changed the fetcher and active checks value much higher. I have to check if there are still false positive messages.
But i am still wondering that the server could not always reach 100% to reach the service check rate.

martin.hirschvogel · October 19, 2023, 12:56pm

In that case, the number of checkers and fetchers is too high

mschlenker · October 19, 2023, 10:00pm

So, performance wise everything looks good, but still host occasionally go down? Checkmk uses “smart ping” to check whether hosts can be reached. Should be not a big deal, just more ICMP traffic than normal. However, we have seen several environments, where virtualization was responsible for the problems handling massive interrupts due to smart ping, like an emulated Intel e1000. So please check the emulated ethernet adaptors used.

mschlenker · October 20, 2023, 6:13am

And another hint from a colleague: Please check if you can ping the hosts in question from the Checkmk server. On Windows, ICMP echo is off as default. If smart ping fails, but agent output can be retrieved, Checkmk (correctly) assumes a host to be up. But if for some reason agent output is delayed, these hosts might come down for one check interval.

Christian1 · October 20, 2023, 8:52am

Thanks
Yes, the hosts are a all pingable.

At the moment I’m following the assumption with the virtual interface. Whether that could be a problem. I moved the VM from a KVM to a HyperV host.

What is it actually like on other Checkmk systems?
Is the service check rate always >100% or does it also fluctuate?

robin.gierse · October 25, 2023, 6:39am

Checkmk - especially in the enterprise editions with the Checkmk Micro Core - performs exceptionally. There are customers out there monitoring between 500,000 and 1,000,000 services with thousands of hosts on a single site. That being said, it is extremely unlikely, that your performance issues come from the capabilities of Checkmk itself. Experience suggests, it has to be a configuration issue, or an underlying problem.

You can check the logs in $OMD_ROOT/var/log/, maybe they can give you a hint. If possible open a support ticket with our support. This is probably something that is easy to fix, once one of our engineers can get a look at the environment.