CheckMK runs pretty slow when second site is turned on

ademi · November 21, 2023, 9:47am

CMK version: 2.2.0p11
OS version: Debian 12 Bookworm

Hi. Still trying to figure out why our CheckMK instance gets slower and slower as time passes. Original Post here.

We have one main (~60 hosts) site which causes no issues when running on its own, and one problematic site (~40 hosts).

After a reboot, both sites work super well. They’re very snappy, the load is very balanced on the CPU cores and there’s always a free CPU when the spikes happen for any background processes (mainly checks). Although, this isn’t always the case, as after about an hour or two of letting them do their thing, idle CPU load is significantly higher (sitting at about 20-40% while not running checks), whereas the spikes end up reaching 100% on all cores.

To check whether we have a faulty site, we ended up taking down the entire “slow” site. Instead of that, we created a new site from zero and added just a couple of hosts to see how it will run. Despite adding what is practically 1/10th of the hosts from the slow site, this new site also starts to overload the CPU cores after running for about an hour. Still no clue as to what could be happening. Any ideas?

andreas-doehler · November 21, 2023, 10:54am

In your slow site you should take a look at the execution time of the check_mk service. Is there some services that take longer than some seconds?
Does you make some modifications to the Nagios core settings? Like longer maximum execution time?

ademi · November 21, 2023, 12:11pm

In the new site, everything looks perfectly fine. Execution times are about 0.5-1s, and no modified settings whatsoever → all defaults.

In the previous slow site, the exec times ranged from 0.5s to 12s, mostly averaging around 3-5s.

andreas-doehler · November 21, 2023, 1:34pm

If the check time is fine I would check what consumes my CPU time.
As you use the raw edition the Nagios process should be one of the bigger processes regarding the CPU usage. What do you see also there?
You wrote that you have 40 hosts on the problematic die. That means with 1 second execution time for the check_mk service that with 2 CPU cores the system should be fully relaxed. Please check that the check_mk discovery and hardware/software inventory service check interval is really some hours. I had some systems where these two services where also executed every minute. This can “kill” a small system.

ademi · November 21, 2023, 1:50pm

No HW/SW Inventories to be checked. Discovery runs every 2h, now set it to 4h.

What makes the CPU spike is a process calledomd config show CORE, as well as the rrdcached.

andreas-doehler · November 21, 2023, 2:03pm

A rrdcached problem points to a very slow storage.
I/O problems are a valid reason for a “slow” CMK system.

ademi · November 22, 2023, 8:12am

Also seems to not be a problem. Storage is more than fast enough to keep up with the requirements. Besides, from what we can see, the CPU load is way too high. I believe that if the issue was with the I/O, we’d see a lot of time wasted idling which isn’t the case here…

andreas-doehler · November 22, 2023, 10:13am

Then you also need to see some processes consuming the CPU. If no,t high CPU load also points to I/O problems. If CPU usage is high then it is something different.

ademi · November 23, 2023, 9:02am

We moved CMK to a separate Server with an SSD in it. Problem sadly persists still . CPU usage is high. Even with one site, when the service checks are running, Service Speed-O-Meter drops to 30-40%.

I’ll highlight again that we had the exact same configuration on version 1.5.0 which was running extremely smooth on a single CPU. Now for 2.2.0 we have bumped it up to 4 CPUs and still struggling hardcore.

We believe that the parsing of the information from the Hosts might be what’s taking so long. Could that be it?

gstolz · November 23, 2023, 9:24am

this actually sounds a little odd - how long/high are those spikes?

omd config show CORE - afaik only has to check the settings in ~/etc/omd/site.conf. I wouldn’t expect this process to be visible when looking for cpu spikes at all.

How long does it take if you run it manually? And what does it return? Would you mind sharing the site.conf file?

ademi · November 23, 2023, 9:47am

They last about 2-4 seconds, 100% usage across 4 CPUs

Problem-causing site:

CONFIG_ADMIN_MAIL=''
CONFIG_AGENT_RECEIVER='on'
CONFIG_AGENT_RECEIVER_PORT='8002'
CONFIG_APACHE_MODE='own'
CONFIG_APACHE_TCP_ADDR='127.0.0.1'
CONFIG_APACHE_TCP_PORT='5001'
CONFIG_AUTOSTART='off'
CONFIG_CORE='nagios'
CONFIG_CRONTAB='on'
CONFIG_DEFAULT_GUI='check_mk'
CONFIG_DOKUWIKI_AUTH='off'
CONFIG_LIVESTATUS_TCP='off'
CONFIG_LIVESTATUS_TCP_ONLY_FROM='0.0.0.0 ::/0'
CONFIG_LIVESTATUS_TCP_PORT='6559'
CONFIG_LIVESTATUS_TCP_TLS='on'
CONFIG_MKEVENTD='on'
CONFIG_MKEVENTD_SNMPTRAP='off'
CONFIG_MKEVENTD_SYSLOG='off'
CONFIG_MKEVENTD_SYSLOG_TCP='off'
CONFIG_MULTISITE_AUTHORISATION='on'
CONFIG_MULTISITE_COOKIE_AUTH='on'
CONFIG_NAGIOS_THEME='classicui'
CONFIG_NAGVIS_URLS='check_mk'
CONFIG_NSCA='off'
CONFIG_NSCA_TCP_PORT='5667'
CONFIG_PNP4NAGIOS='on'
CONFIG_TMPFS='on'

For the original site:

CONFIG_ADMIN_MAIL=''
CONFIG_AGENT_RECEIVER='on'
CONFIG_AGENT_RECEIVER_PORT='8001'
CONFIG_APACHE_MODE='own'
CONFIG_APACHE_TCP_ADDR='127.0.0.1'
CONFIG_APACHE_TCP_PORT='5002'
CONFIG_AUTOSTART='on'
CONFIG_CORE='nagios'
CONFIG_CRONTAB='on'
CONFIG_DEFAULT_GUI='check_mk'
CONFIG_DOKUWIKI_AUTH='off'
CONFIG_LIVESTATUS_TCP='off'
CONFIG_LIVESTATUS_TCP_ONLY_FROM='0.0.0.0 ::/0'
CONFIG_LIVESTATUS_TCP_PORT='6557'
CONFIG_LIVESTATUS_TCP_TLS='on'
CONFIG_MKEVENTD='on'
CONFIG_MKEVENTD_SNMPTRAP='off'
CONFIG_MKEVENTD_SYSLOG='off'
CONFIG_MKEVENTD_SYSLOG_TCP='off'
CONFIG_MULTISITE_AUTHORISATION='on'
CONFIG_MULTISITE_COOKIE_AUTH='on'
CONFIG_NAGIOS_THEME='classicui'
CONFIG_NAGVIS_URLS='check_mk'
CONFIG_NSCA='off'
CONFIG_NSCA_TCP_PORT='5667'
CONFIG_PNP4NAGIOS='on'
CONFIG_TMPFS='on'

OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 443870162 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 446501496 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 434240744 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 445244880 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 452490478 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 437456479 nanoseconds.
OMD[username]:~/testing$ bash script.sh
nagios
Execution time was 417073660 nanoseconds.
OMD[username]:~/testing$

Script calls omd config show CORE and times it. NOTE: It is not consistently this low. Sometimes it takes 5-6s to return

andreas-doehler · November 23, 2023, 10:14am

On such a small system the “Speed-O-Meter” cannot really be used. As it calculates the number of services divided by 60 should be checked per seconds. It cannot know how many services every host has and if the schedule for the hosts is evenly distributed over the one minute check interval.

You cannot compare 1.5 with 2.2. There are worlds of complexity in between.

I/O problem don’t need to be a problem with the storage - I/O also means network wait times and so on.

For a further understanding of your problem some screenshots for running processes and some load and utilization graphs would be good.

ademi · November 23, 2023, 12:28pm

Understood. Will keep those things in mind.

On the other hand, I have noticed that in the /etc/log/syslog file, the timedated daemon keeps getting activated and deactivated over and over again, every minute. Is that normal?

This is CPU util overnight:

It seems normal in the graph but there are large spikes where the entire system freezes and stutters for a few seconds. I would guess those aren’t quite included in the graph as it samples only a single moment either before or after the spike.

Here is the CPU when “idling”:

I can’t get a screenshot of the spikes currently but it’s mostly omd config show CORE processes alongside processes running the checks.

andreas-doehler · November 23, 2023, 12:53pm

I think you see a typical Nagios core problem. Bad scheduling of service checks.
There are some options inside the Nagios config files that can be tuned.
In your case with only some hosts (not some hundreds) i would inspect the option “max_concurrent_checks”. The default setting is 0 and Nagios can try to check as many checks as it wants at the same time.
With CMK a value like the number of CPU cores for this would be a good value i think.
But as i have nearly no bigger CMKs running on Nagios anymore, i cannot give you some good recommendations.

ademi · November 23, 2023, 1:16pm

That sounds like a good possible solution actually. Just to make sure, we’re talking about the setting in the file /opt/omd/sites/username/etc/nagios/nagios.cfg ? Do I need to have CMK down/ restart it after changing it?

andreas-doehler · November 23, 2023, 3:17pm

Yes these settings are inside one of the files there.
A onetime “cmk -R” after change is enough to activate it.

ademi · January 18, 2024, 7:30am

I’m back to work after a while. I was going through emails and I saw this thread. I realized I never replied. Thank you so very much for your time!!! This last suggestion fixed our issues and now our server runs like a well oiled machine.