Checks running in bursts causing OOM

andreas-p · June 16, 2025, 12:46pm

Currently running 2.2p43, with 90 hosts to check.

In the last weeks, we encounter OOM situations, with mostly rrdcached as victim. The memory was increased from 4GB to 5GB, which kind of solved the problem (for now), but I still see the swapfile used for some 200MB.

The problem seems to be triggered by running too many host check python processes in parallel (/omd/sites/site/bin/python3 /omd/sites/site/var/check_mk/core/helper_config/latest/host_checks/HOSTNAME). Counting the processes each second, I see a pattern like this:

45-52 for 5 seconds
34/18/14 the next for 3 seconds
4-6 for 25 seconds.
0 procs for 27s
and restarting the cycle.

This results in some 450 seconds total runtime for checks over all hosts, which is expected since there are quite some agents that have to collect a lot of data, but the distribution over time seems unhealthy.

So I wonder if I miss some configuration like “don’t run more than 20 checks parallel”, which would be fully sufficient to scan all hosts each minute.

dabrain · June 16, 2025, 1:06pm

You might would like to take a look at Analyze configuration, which gives you a short overview based on your configuration and setup.

Setup - Maintenance - Analyze configuration

andreas-p · June 16, 2025, 1:20pm

No red flag under “Analyze configuration”, only apache number of processes are in warning state in the performance section (64 configured, 4 actually used).

dabrain · June 16, 2025, 1:41pm

Yes, don’t worry much about the apache number of process. Just to mention there is a difference between the Enterprise Edition and the RAW Edition. The last one has high CPU load during execution of checks.

Anyhow for production use I would recommend 4CPU / 8GB as a normal system. Your mileage my vary. You might already monitor your CheckMK Server as well, which might gives you a good overview about memory usage.

andreas-p · June 16, 2025, 2:56pm

So I interpret this as “there’s no option somewhere to limit concurrent check processes”, I have to deal with up to (n Hosts) x (40MB per Process).

andreas-doehler · June 16, 2025, 5:56pm

There is an option but only on the command line → ~/etc/nagios/nagios.d/timing.cfg
Adjust the option “max_concurrent_checks” in maximum to the double number of your CPU cores or lower.
Per default this value is set to 0 what means start all checks you want

andreas-p · June 16, 2025, 7:58pm

Thanks for the hint, haven’t been down to nagios for ages

I configured max_concurrent_checks=20 (running 4 CPUs, but those check processes are mostly waiting for the agent anyway), and now I see a much smoother distribution between 0 and 10 concurrent python check processes.

Feels a lot saner to me now: CPU utilization is still at 20%, while 15min load is down from 6 to 1.

Thanks for your support!
Best regards,
Andreas

Yggy · June 17, 2025, 8:42am

@andreas-p & @andreas-doehler
Thanks for this question and answer.

My environment (old hardware) suddenly runs a lot smoother after configuring max_concurrent_checks=8.
The 15 minute CPU load dropped from 10 to 2. And RAM usage from 60 to 20% average.
Nice! (-:

Edit:
Wow!
It even solved an issue that I had for years, which I thought was a side effect of monitoring endpoints. In the morning when endpoints were booting up, lots of check_mk services would by default fail and needed to be manually rescheduled. I always assumed that those endpoints weren’t ready yet to being monitored. While it was actually the other way around, the server having OOM issues.

system · June 17, 2026, 8:42am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.