By default, the raw edition has no limit for concurrently executed checks, so the scheduler might run a huge bunch of checks at once. This can overwhelm the available CPU cores and lead to spikes in load, slow response times of the web interface and other unwanted things.
The solution that is often proposed here in the forums is to limit the maximum concurrent checks. But:
- if you limit too much, you will end up with stale services
- if you limit too little, you’ll still have bursts and spikes in load
This is a short tutorial on how to do so and how to tune this value.
Disclaimer: I have done this on my instances and it worked well. I am not an expert; YMMV. Additions, corrections and suggestions welcome.
Simple method
- log in via SSH and become the site user (e.g.
omd su mysite) - edit
~/etc/nagios/nagios.d/timing.cfgand setmax_concurrent_checksto some number. A general recommendation here is twice the number of CPU cores. - run
omd restart nagiosto apply the change.
This should hopefully fix any overload issues you’ve had. If this was all you wanted to achieve, you can stop reading here.
Further fine-tuning
Maybe you don’t have a lot of checks and want to distribute the load even more evenly over your check interval, or maybe you just want to test out the limits of your machine (or how much headroom you have left). Here’s how you can do this:
First, keep an eye on the staleness of services. To do that, again as SSH site user, run this command:
watch "uptime; echo -n 'stale services: '; lq 'GET services\nStats: staleness >= 1.5\nFilter: host_state = 0\nFilter: scheduled_downtime_depth = 0\nFilter: host_scheduled_downtime_depth = 0'; echo 'services by staleness:'; lq 'GET services\nColumns: staleness host_name description\nFilter: host_state = 0\nFilter: scheduled_downtime_depth = 0\nFilter: host_scheduled_downtime_depth = 0\nFilter: staleness >= 1' | sort -hr"
This runs Livestatus queries and will produce output like this:
18:16:38 up 1 day, 10:13, 5 users, load average: 6.05, 5.98, 6.61
stale services: 0
services by staleness:
1.11667;myhost01;Check_MK
1.05833;myhost02;debian_packages
1.05833;myhost03;zpool status
You can see the load average, the number of stale services (should correspond to the number in the Checkmk sidebar’s “overview” snap-in) and a list of services having a staleness value of at least 1.0. By default, a staleness value of >= 1.5 is considered stale by Checkmk (for details on this, see end of this post).
Optional: open another SSH window and run htop to see the utilization of your CPU cores.
If you don’t have any (unexpected) stale services, decrease the max_concurrent_checks even further. This will smooth out the CPU usage. The more you decrease the limit, the more you push some checks towards the end of the check interval time window. At some point, this will raise the staleness numbers in the watch command output.
Keep decreasing it, and at some point, you will start to see stale services that won’t go away after a few minutes. This is the point where the number of max concurrent checks is too low for your check interval.
Now raise it again until the number of stale services goes to 0 (optimally, raise it until you don’t have services with more than ~1.1 staleness in the watch output all the time - everything in the list is technically overdue, but 1.0-1.1is still fine imho) and you should have distributed your checks pretty well over the check interval!
Analyzing the results
If your machine has 20 cores and hence the recommended value would be 40, but you were able to lower the limit to 15, you now know you still have a lot of headroom. On the other hand, if you already get a few stales at 35, your machine probably can’t handle a lot more. Another indicator for this could be the CPU usage (both in htop and the graph in your CMK).
If you leave the
max_concurrent_checkslimit set at its lowest reasonably usable value like this, keep in mind that over time, as you add new hosts and services to the monitoring, you might need to raise it again as to avoid stales.
If you’d rather not change this number every now and then, stick to the “2x CPU cores” rule of thumb.
I wouldn’t necessarily call this method a “best practice”, just as an interesting experiment - and it helps keep more CPU headroom for other tasks, such as using the web interface. Activation times also went down a bit for me after doing this.
Tips & explanations
What is staleness?
The staleness value tells you how old check data is in relation to the check interval. For example, if your Normal check interval for service checks is 60 seconds, meaning checks are supposed to be run every 60 seconds, a staleness of 1.5 would mean that the check data is 90 seconds old (1.5x the check interval).
I have set it to 2x my CPU cores but still have stales
If you are sure that the reason for your stales is the performance of your CMK machine, you can try raising the limit even further. This might not solve your problem though, because there is only so much your hardware can handle. You might simply have too many checks!
In this case, if throwing more powerful hardware at CMK is not an option, go to Setup ➜ Service monitoring rules ➜ Normal check interval for service checks and raise the interval. Try 2 minutes instead of 1.
I have only a few stales every now and then, but I don’t want to increase my check interval
Look at the list of stales in the watch command earlier in this post. If the staleness values are barely over 1.5 and you don’t want to increase the max_concurrent_checks limit (or raising it doesn’t help anymore), and increasing your interval is not an option:
In Setup ➜ Global settings ➜ Staleness value to mark hosts / services stale, you can define from which staleness value a host or service should be considered stale by Checkmk. The default is 1.5. You can raise this to decrease the number of stale services in your instance, but keep in mind that this is purely cosmetic. It won’t solve the underlying problem, it will just hide the symptoms.
If you notice that the stale services you observe are always of the same type, and you know they’re checks that need a long time to run, consider running them asynchronously (less frequently), e.g. by moving them in interval subfolders (for on-client checks in the local or plugins folders), adding an interval in mrpe.cfg (for MRPE checks) or creating dedicated service check interval rules for them.