Performance Tuning CEE 2.1

bmst · January 30, 2023, 7:52am

CMK version: CEE 1.6 → 2.0 → 2.1p17
OS version: Centos 7

I should preface this by saying that performance tuning (via “Maximum concurrent Check_MK checks” and similar settings) was not a problem for me under 1.6. We have full check rates and such on all our distributed sites. Our main site (which my test site is cloned from) is currently running over 300 services per second. I have also checked the usual system metrics (CPU usage is low, RAM is high but not pushing into swap, disk IO looks reasonable).

After upgrading to 2.1, “Checker helper usage” gets pinned to the max and the service check performance is trash.

Service checks:	3.44/s
Host checks and smart pings:	53.37/s
Livestatus-connections:	0.41/s
Average check latency:	0.00s
Average Checkmk latency:	0.00s
Average fetcher latency:	30.2ms
Check helper usage:	2.2%
Fetcher helper usage:	70.2%
Checker helper usage:	100.0%

(note: one of those latencies was “10h” when I first started playing, I think this number is low here because it hasn’t had time to develop problems during my testing)

Increase Maximum concurrent Checkmk checkers alleviates this

Service checks:	172.87/s (Note: this is about the correct rate, I lowered the check intervals for the test site)
Host checks and smart pings:	49.57/s
Livestatus-connections:	0.19/s
Average check latency:	0.00s
Average Checkmk latency:	0.00s
Average fetcher latency:	0.00s
Check helper usage:	39.1%
Fetcher helper usage:	26.2%
Checker helper usage:	17.9%

but now activating changes shows the warning:

check_mk: The number of configured checkers is higher than the number of available CPUs. To avoid unnecessary context switches, the number of checkers should be limited to the number of CPUs. Recommended number of checkers: 4

Fetcher helper usage now also seems quite high at times, often 90+% at which point the service check rate tanks again for a short period.

What’s going on here? How am I supposed to tune the settings to deal with each of these percentages?

martin.hirschvogel · January 30, 2023, 8:06am

While the general rule is not to have more checkers than cores, you can also set it to higher than that, as long as you do not end up with context switches reducing overall performance. The message is thus a precaution for users not to do unconsciously problematic decisions.

In the end, try to keep it below 70% for both checker and helper. Adapt the number of helpers. If your CPU can handle it, you are fine. Don’t do crazy increase on the checkers though please

EDIT: Based on a call with Gerd, I am withdrawing my statement regarding “time-consuming” checks That what Gerd is mentioning in his reply.

gstolz · January 30, 2023, 9:24am

Hi Martin,

I’m trying to think what kind of checks (not classical nagios checks, but checkmk checks, which are processed by the Checker helper processes) would profit from having more Checker Helpers than CPUs. Can you give an example, or did I misunderstand something?

Gerd

robin.gierse · January 30, 2023, 4:08pm

There is several reasons for high checker and/or fetcher usage.
I doubt that your environment has the size to justify the performance you describe.

But pinpointing the exact issue might prove to be painful.

schnetz · January 30, 2023, 7:15pm

Does your setup contain some clustered hosts?
We are currently fixing some serious performance issues related to clustered service rules.

bmst · January 31, 2023, 1:01am

We do have about 10 of those “cluster hosts”, yes, each drawing a few services from a pair of hosts.

150 services/s, 30k services, 700 hosts, in the scope of my test site. Doesn’t seem unreasonable, no.

Heh. I’ve had it running overnight with the “Fetcher helper usage” at 30-70% (swings quite a bit - probably because of my low check rate) by turning “Maximum concurrent Checkmk checkers” up to… 24. Might be what you’d call crazy?. 2500 switches per second - I don’t know what a problem would be for that figure, but it’s far from the highest I’ve seen on machines with any real workload.

LaMi · January 31, 2023, 6:52am

The checker pool provides you with potential concurrent Checkmk check executions (1 Checkmk host can be computed per checker at a time). Scaling up the pool increases the amount of concurrent checks you can execute at the cost of consumed memory since every checker consumes some amount of memory. And of course, in the moment the pool is fully utilized, it consumes CPU power.

As long as your system has enough CPU and memory capacity, you are totally free to scale the pool up and also to configure it in a way which gives you some head room in the pool usage. The general advice to tune the pool utilization to ~70-80% is a good rule of thumb, I think. If your system has the resources available to do so.

Regarding the fairly new message (added in 2.1.0p10):

check_mk: The number of configured checkers is higher than the number of available CPUs. To avoid unnecessary context switches, the number of checkers should be limited to the number of CPUs. Recommended number of checkers: 4

Currently I have the impression that this is measurement and recommendation is in many cases not fitting with reality. Comparing these numbers might be some kind of first guesstimate, but it seems to be too much way off (at least in your case). We’ll discuss this internally.

The assumption behind the message is that the checkers can continuously compute check results and consume computation time continuously. The checkers are designed to be CPU bound. However, in reality there is still IO involved (CMC <> checker communication, disk IO for counter persistence, and a few more), so there may be times where a checker is not fully utilizing a CPU core. If you have this situation, it is totally fine to have some other checker processes available which jump in to do their work and keep the CPU better utilized.

As I wrote earlier I would recommend to have a look at the actual a) demand of Checkmk (pool utilization) and b) capability of the system (resource utilization) and then tune the pool size based on that. Comparing the numbers should be fairly straight forward and should lead to a better decision.

bmst · February 1, 2023, 2:50am

This is more or less the philosophy I had with tuning “Maximum concurrent Check_MK checks” in 1.6, with good results. Lots of RAM used, but performance is good. I’d do the same with this setting, only it very specifically warns me not to.

And… seems I might not have to? I need to restart my test process I think. I’ve been fiddling and I’m back down to the recommended 4 helper processes and everything is smooth somehow.

I did also upgrade to 2.1.0p20 - not sure if there were performance changes between p17 and p20.

This might all turn out to be a big nothing.

andreas-doehler · February 1, 2023, 6:57am

Only a small comment from my experience with now many 2.1 systems.

Maximum concurrent Checkmk fetchers - this is the real point where you can tune the system, the amount needed depends on the mix of devices (SNMP/agent) and on the runtime for every device
- One example - 2000 hosts - 1minute check interval - 30 fetcher - 4 checker - all the hosts check_mk runtime is between 500ms and 1,5 seconds - performance of the site is 350 host checks per second and 1400 services per second - fetcher usage 60% and checker usage 30%
- if you have many SNMP hosts you need many more fetcher than the 30 to get all hosts checked in one minute
Maximum concurrent Checkmk checkers - is normally no problem also on my bigger system
Maximum concurrent active checks - only important if you have many active checks

All the other points are like @LaMi wrote.

j_Lowe · June 23, 2023, 5:39pm

Possible dumb question here, but it complains based on CPU count, though it recommends based on Core count. So having 2 cores per cpu, which recommendation are we to go with?

In my case, should I go with 4*2=8-1=7? or one per cpu as per the message when I try to activate the changes?

mschlenker · June 23, 2023, 6:55pm

The assumption is that when the checker runs, all data is available locally and block IO is no bottleneck. In theory: Checkers consume CPU and never have to wait for IO. So the number of checkers should be about equal to the number of physical cores available. (There can be situations where a sightly higher number that still is below the number of threads offered by the CPU can slightly improve performance)

I guess we have to adjust the message shown to be more clear on this.

j_Lowe · June 28, 2023, 3:35pm

…the hosts are vmware, so actual physical cores are completely obfuscated. say the virtual is set to 4 cpu’s with 2 cores each.
IIRC even a dual cpu computer with 2 cores/cpu those are actual cores as well. THEN it gets past that into hyperthreading.
In both those cases though, whether set to 4cpu/2core, or a physical 4cpu/2core the number would be 8 correct?

mschlenker · June 28, 2023, 3:59pm

Yes, in this case you start with eight.

robin.gierse · August 11, 2023, 1:51pm

I might have misread, but 8 would be more than the maximum recommended number. The recommendation is “number of cores -1” as the maximum Checkers number. But you want to start small and increase gradually as needed.

martin.hirschvogel · August 12, 2023, 5:08am

The message was adjusted, either already for 2.2 or in 2.3