Checks running in bursts causing OOM

Currently running 2.2p43, with 90 hosts to check.

In the last weeks, we encounter OOM situations, with mostly rrdcached as victim. The memory was increased from 4GB to 5GB, which kind of solved the problem (for now), but I still see the swapfile used for some 200MB.

The problem seems to be triggered by running too many host check python processes in parallel (/omd/sites/site/bin/python3 /omd/sites/site/var/check_mk/core/helper_config/latest/host_checks/HOSTNAME). Counting the processes each second, I see a pattern like this:

45-52 for 5 seconds
34/18/14 the next for 3 seconds
4-6 for 25 seconds.
0 procs for 27s
and restarting the cycle.

This results in some 450 seconds total runtime for checks over all hosts, which is expected since there are quite some agents that have to collect a lot of data, but the distribution over time seems unhealthy.

So I wonder if I miss some configuration like “don’t run more than 20 checks parallel”, which would be fully sufficient to scan all hosts each minute.

You might would like to take a look at Analyze configuration, which gives you a short overview based on your configuration and setup.

Setup - Maintenance - Analyze configuration

No red flag under “Analyze configuration”, only apache number of processes are in warning state in the performance section (64 configured, 4 actually used).

Yes, don’t worry much about the apache number of process. Just to mention there is a difference between the Enterprise Edition and the RAW Edition. The last one has high CPU load during execution of checks.

Anyhow for production use I would recommend 4CPU / 8GB as a normal system. Your mileage my vary. You might already monitor your CheckMK Server as well, which might gives you a good overview about memory usage.

So I interpret this as “there’s no option somewhere to limit concurrent check processes”, I have to deal with up to (n Hosts) x (40MB per Process).

There is an option but only on the command line → ~/etc/nagios/nagios.d/timing.cfg
Adjust the option “max_concurrent_checks” in maximum to the double number of your CPU cores or lower.
Per default this value is set to 0 what means start all checks you want :smiley:

4 Likes

Thanks for the hint, haven’t been down to nagios for ages :wink:

I configured max_concurrent_checks=20 (running 4 CPUs, but those check processes are mostly waiting for the agent anyway), and now I see a much smoother distribution between 0 and 10 concurrent python check processes.

Feels a lot saner to me now: CPU utilization is still at 20%, while 15min load is down from 6 to 1.

Thanks for your support!
Best regards,
Andreas

2 Likes

@andreas-p & @andreas-doehler
Thanks for this question and answer.

My environment (old hardware) suddenly runs a lot smoother after configuring max_concurrent_checks=8.
The 15 minute CPU load dropped from 10 to 2. And RAM usage from 60 to 20% average.
Nice! (-:

Edit:
Wow!
It even solved an issue that I had for years, which I thought was a side effect of monitoring endpoints. In the morning when endpoints were booting up, lots of check_mk services would by default fail and needed to be manually rescheduled. I always assumed that those endpoints weren’t ready yet to being monitored. While it was actually the other way around, the server having OOM issues.