High CPU usage of check_mk during massive failures

Hi,

We are using:
CMK version : 2.0.0p5 (CEE)
OS version : Ubuntu 16.04.7
Number of hosts : 740
Number of services : 15000
CMK is running in AWS, m4.xlarge instance (4 vCPU, 16 GiB)

Under normal conditions CMK is running absolutely fine, but when we face some huge outages (like DB failures, cloud provider outages, internal chaos testing) CMK faces huge delays in processing notifications, and notification processes are using quite big amount of CPU.

Example CPU usage:

ps aux --sort=-pcpu | head -n 10
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
frankfu+  3559 49.0  0.1  61640 31584 ?        Rs   11:14   0:00 python3 /omd/sites/frankfurt/bin/cmk --notify --log-to-stdout spoolfile /omd/sites/frankfurt/var/check_mk/notify/spool/d05f4644-0568-45e7-95e2-14f814dcf8c9
frankfu+  3558 48.0  0.1  61296 31288 ?        Rs   11:14   0:00 python3 /omd/sites/frankfurt/bin/cmk --notify --log-to-stdout spoolfile /omd/sites/frankfurt/var/check_mk/notify/spool/ed0f669b-ec1c-4c2a-864f-c6ca8c259760
frankfu+  3557 46.0  0.1  61256 31152 ?        Rs   11:14   0:00 python3 /omd/sites/frankfurt/bin/cmk --notify --log-to-stdout spoolfile /omd/sites/frankfurt/var/check_mk/notify/spool/0007ed8c-3b3b-4389-a2b3-51aaf28c9fcf
frankfu+  1392 18.7  1.1 241408 195664 ?       R    Jun12 1100:32 python3 /omd/sites/frankfurt/bin/cmk --checker
frankfu+  2795 10.4  0.9 332312 148476 ?       S    11:13   0:06 /usr/sbin/apache2 -f /omd/sites/frankfurt/etc/apache/apache.conf
frankfu+  1397  9.5  0.2  76532 38364 ?        S    Jun12 560:15 python3 /omd/sites/frankfurt/bin/fetcher
frankfu+  1398  8.5  0.2  76340 38232 ?        S    Jun12 503:02 python3 /omd/sites/frankfurt/bin/fetcher
frankfu+  1399  7.3  0.2  76536 38492 ?        S    Jun12 434:22 python3 /omd/sites/frankfurt/bin/fetcher
frankfu+  1401  5.9  0.2  76900 38984 ?        R    Jun12 347:47 python3 /omd/sites/frankfurt/bin/fetcher

Also, notifications are processed slowly which leads to queueing them (1000 - 2000 average):

ll /omd/sites/frankfurt/var/check_mk/notify/spool/ | wc -l
955

During such periods we also face alerts from CMK about site performance (fetcher and helper usage). Not sure, but seems like it’s throttled by notification processes. Some screenshots for CPU load and helper usage:


I have noticed that CPU consumption seems to grow up since we updated CMK from 1.6 to 2.0, but I’m not sure this is a root cause or related.

Is there anything else I can do/investigate or just adding more resources is the right way to go?

Hi
can yo specify a little more what you mean with “huge outages”? How many of your hosts and services are affected then?

BR

Sure. For example this could be something up to 100 hosts start becoming unreachable. Or up to 300 services on up to 100 hosts start having problems.

Hello,
what Kind of Notification do you use? Do you directly receive Notofication Mails from the checkmk-host?

regards
Christian

“host become unreachable”: you mean UNREACH or DOWN state?
“services start having problems”: do they go CRIT or WARN?

Or do they all (hosts & services) go stale?

BR

CFriedrich pagerduty + mail (mails directly from host). I know PagerDuty has throttling limit, but it doesn’t look like we reach it, as soon as I can see only 2xx responses from PagerDuty.

wittmannthom hosts mainly go DOWN state. Services something like 90% CRITICAL and 10% WARNING. But both OK → WARN and OK → CRIT changes are handled by the same notification rule, if it’s important.

Out of the box i’d think there might be too many notification rules doing the same thing.
Can you confirm that that all notification rules which do not handle mails are properly set up? In the past i have seen setups that trigger notifications more than once per hard state and thus caused a “jam”.
Also, what does your ~/var/log/notify.log and ~/var/log/mknotify.log say when this occurs?

BR

2 Likes

I have had Checkmk stalling to a near-stop during large connectivity outages because:

  1. SNMP queries keep a worker process busy while waiting for a response, even if there isn’t one coming (because of the outage)
  2. We do SNMP every five minutes, with a minute timeout, and attempt five retries (that’s 1+5 attempts, times 1 minute, = 6 minutes of waiting on a device in a 5 minute window)
  3. The default worker process count is less than the number of SNMP devices we have

… resulting in Checkmk being totally clocked out waiting on SNMP messages and having no time for real checks.

Possibly you have something similar going on?

We resolved this by increasing the helper process count (under global settings - warning, does slightly increase site restart time (not reload, that’s still fast) so don’t go craaazy with it), reducing the SNMP attempts to 2, and reducing the SNMP timeouts to 30 secs for a particular large group of devices for which I can tolerate slightly less resilient monitoring (though I still have not had problems with them anyway).

We also contemplated having an extra distributed site just for monitoring SNMP devices, so at least the damage will be restricted to a small section of our monitoring. We didn’t end up doing this, but it’s another option.

I would say then the SNMP config is wrong.

Please don’t do this.

Normally the default SNMP setting don’t need to be changed. 30 seconds timeout is way to long.

I’ve learnt that now :smile: I inherited this setup from a predecessor who picked some awfully over-cautious settings without fully contemplating/understanding the end effects of them and interplay between them. Although at the same time, I believe some SNMP devices tend to awfully under-deliver when it comes to performance - I don’t think that timeout was raised so high on whim.

I’m not recommending those settings, they did cause us problems, simply describing the scenario under which I had problems with symptoms similar to OP, in the hope that it may provide clues and a direction of investigation.

I our case the problem was because of poorly configured PagerDuty notifications.
Each notification to PagerDuty takes nearly 0.6 seconds. And in out case we had “Notify all contacts of the notified host or service.” option selected. Hosts have 3-5 users in contact group, so this leads to a significant load.
Also, because of the incident_key requests were not duplicated on PagerDuty side, so it was not so obvious.

Creating a separate service user and reconfiguring notification rules to just notify this user should more or less solve the problem. We haven’t had huge failures so far, but at least tests are quite promising.

Ideally, this should be noted somewhere related to PagerDuty because I suffered the same problem.

An additional minor problem with PagerDuty is that it does not deal with Downtime related notifications by default, and the pagerduty_event_type dictionary needs to be expanded to deal with it.