CheckMK Performance

CMK version:2.2.0p5 Enterprise Edition
OS version: Debian 11

Hey people,

We have invested a lot of time to improve the performance in checkmk. The checkmk is even relatively small with 4000 services.
We need a service check rate of 46 checks/s but the checkMK often cannot achieve this. This often leads to incorrect messages. A host is reported as down even though it is not.
We have already adjusted and tested a few values ​​in the global settings but nothing solves the problem.

Our settings:

Maximum concurrent active checks 100

Maximum concurrent real-time checks 12

Maximum concurrent Livestatus connections 20

Maximum concurrent Checkmk fetchers 13

Maximum concurrent Checkmk checkers 12

Unfortunately, the standard settings don’t help here.

If anyone has an idea or improvement, please share it.

Checkers should be about the number of physical cores, so 12 might be good. 4000 services implies 80 to 200 hosts. If some of those are slow, they block fetchers, so double the number of fetchers as a start.

And please update to the latest patch level, there have been performance improvements.

1 Like

Hi @Christian1

can you please share the Checkmk server specs? (CPU, RAM, SSD or HDD)

4000 services is nothing for a regular Checkmk installation and should not lead to performance issues. That’s why it is important to know which resources are available to the Checkmk server.

Thanks in advance!

Norm

1 Like

Thanks, i tested the checkers with 25 and more.

If the server completes all checks in time, the scheduled check rate is always >100% or not?
Because at the moment it oscillates between 70% and 114%.

Have a look at fetchers, these are more often the bottleneck and depend extremely on your environment (network, slow check plugins…).

The Server is a 12 core VM with 8 GB Ram. 5 GB is used. The Host system runs with SSD drives. The CPU is an old one. Intel Xeon 2 Ghz E5 2650.
Could that be the problem?

If the CPU utilization is constantly close to 100%, of course. But I doubt this.

Have a look at the graphs for checker/fetcher utilization (Service graphs OMD sitename performance) to find out if the bottleneck is there. You might post screenshots of these graphs here.

The Serverload is constant on 2,5

Look directly at the graphs to see the historic fetcher and checker usage.
But i doubt you will have to scale checkers. From my experience, 2-6 for your size should be sufficient… the fetchers however you could increase - though they shouldnt have high impact on load.
Can you check what processes consume so much on that machine?

The checker and fetcher usage is always <1%
i have updated checkmk to 2.2.0p12 and changed the fetcher and active checks value much higher. I have to check if there are still false positive messages.
But i am still wondering that the server could not always reach 100% to reach the service check rate.

In that case, the number of checkers and fetchers is too high

So, performance wise everything looks good, but still host occasionally go down? Checkmk uses “smart ping” to check whether hosts can be reached. Should be not a big deal, just more ICMP traffic than normal. However, we have seen several environments, where virtualization was responsible for the problems handling massive interrupts due to smart ping, like an emulated Intel e1000. So please check the emulated ethernet adaptors used.

1 Like

And another hint from a colleague: Please check if you can ping the hosts in question from the Checkmk server. On Windows, ICMP echo is off as default. If smart ping fails, but agent output can be retrieved, Checkmk (correctly) assumes a host to be up. But if for some reason agent output is delayed, these hosts might come down for one check interval.

1 Like

Thanks
Yes, the hosts are a all pingable.

At the moment I’m following the assumption with the virtual interface. Whether that could be a problem. I moved the VM from a KVM to a HyperV host.

What is it actually like on other Checkmk systems?
Is the service check rate always >100% or does it also fluctuate?

Checkmk - especially in the enterprise editions with the Checkmk Micro Core - performs exceptionally. There are customers out there monitoring between 500,000 and 1,000,000 services with thousands of hosts on a single site. That being said, it is extremely unlikely, that your performance issues come from the capabilities of Checkmk itself. Experience suggests, it has to be a configuration issue, or an underlying problem.

You can check the logs in $OMD_ROOT/var/log/, maybe they can give you a hint. If possible open a support ticket with our support. This is probably something that is easy to fix, once one of our engineers can get a look at the environment.

1 Like