One site makes the entire CMK Server slow to a halt

ademi · November 9, 2023, 8:11am

CMK version: 2.2.0p11
OS version: Debian 12 Bookworm

Hi! We updated our CMK instance which was running on 1.5.0 step-by-step all the way up to 2.2.0p11. Initially, we did this only on one site. It went flawlessly - It was very snappy, running extremely well, utilizing minimal resources and just very smooth using a single CPU Core. I slowly updated all the agents of all the hosts there as well and had no problems whatsoever.

However, upon updating the second site, the moment I turned it on, CPU usage maxed out, it was struggling so much that I couldn’t even open the frontend on either site. After giving it 8 CPUs, it’s still struggling. I ended up changing the check interval to be every five minutes, which ended up making it somewhat more responsive, although it’s still on 8 CPUs.

The agents are currently all outdated and all have the old docker plugin, although I don’t think that’s the cause of the problem as the previous site had a very similar setup that it was monitoring. About half of the hosts have a perf-o-meter value of about 10-13 seconds, and the others 3-4 seconds. All of the hosts have the exact same hardware and software running on it.

Any ideas what might be causing this?

r.sander · November 9, 2023, 9:32am

Do you use the HW/SW-Inventory?
Have a look at the normal check interval for this service check. It should be every 12 hours, not every minute. Create a rule specific to this service check.

ademi · November 9, 2023, 10:03am

Nope, we don’t use the HW/SW-Inventory (Under HW/SW Inventory Rules, “Do Hardware/Software Inventory” is 0). - Thank you for the quick response!

Still have no idea what’s wrong…

gstolz · November 9, 2023, 11:58am

when looking at the running processes with htop (or maybe atop, where you can group processes with “p”) - can you see which processes of site2 are creating the load? is it the checks or maybe something going rogue within the apache?

or another way: if you run “omd stop apache” as the site 2 user, does that change the load on the whole system?

ademi · November 9, 2023, 12:22pm

The apache doesn’t use more than 5% CPU at any given time. It’s definitely the checks. Keeping an eye on htop, whenever the original site runs it’s checks, CPU usage peaks for about a second or two and then immediately goes back down to idle. The second site however, keeps the CPU occupied the entire time with checks (when set to check every minute).

One might think “why not leave it to check every 5 minutes” - even that is problematic when we bring down CheckMK’s CPUs back to 1 or 2. The resources are needed elsewhere. PLUS, it’s still pretty unstable

ademi · November 9, 2023, 12:26pm

Actually in hindsight…

OMD[username]:~$ omd stop apache Stopping apache...killing 1739...........................................................................................................................ERROR

It’s SLIIIGHTLY better, the original site is still nowhere near as snappy as before.

gstolz · November 9, 2023, 3:08pm

Are you running only checkmk checks or also active / classical nagios checks?
What processes does htop show you exactly that run for more than 1-2 seconds with a high load?

adding to Roberts question regarding the hw/sw inventory - how often does the “Check_MK Discovery” service run?

ademi · November 13, 2023, 7:49am

I’m running two active checks on two routers on the original site (not the one that causes slowness), everything else is normal checkmk checks. rrdcached seems to be the most common process that stays at the top, but there’s also the checks which also take a good 5-10 seconds each on the second site. The Service Discovery Check runs every 2h.

UPDATE 1: omd config show CORE also seems to occupy a LOT of resources as many instances of it pop up simultaneously at what seems to be every other check.

UPDATE 2: About 20 different processes of /omd/sites/sitename/tmp/nagios/nagios.cfg are running on the second site which is also occupying almost 100% of all CPU cores

ademi · November 13, 2023, 9:18am

Restarting both sites initially stabilizes the entire operation. It seems to work a lot smoother and doesn’t take up as much resources. Although this is only the case for a few hours, eventually it ends up stuttering and lagging again. Occasionally, for some reason, the Service Discovery Check times out on ALL hosts.

This is so weird.

ademi · November 14, 2023, 8:08am

New update: seems like CheckMK simply didn’t check anything over the weekend. Yesterday morning everything had gone stale. We thought its from the restart we did but actually it was from not having gotten any data UNTIL the restart. Any help would be appreciated!!