Check_MK slow sending notifications

Background:
In my company we have a Check_MK instance running on a physical server.

OS: CentOS 7, latest updates
Check_MK version: check-mk-enterprise-1.6.0p12

We are using this server heavily for monitoring the whole infrastructure.
We have ~400 groupings of services to service groups and ~50 groupings of hosts to host groups.
We have ~150 notification rules set up, most of which just send e-mails.
We have ~40000 services.

The server where Check_MK is running is physical, it has a big amount of RAM, most of which is unused and it has multiple CPUs (not sure on the exact number), they are also not used that much.

Problem:
What we are experiencing seems to be delays on notifications. Notifications get sent by Check_MK too late. I received a notification about a service 40 minutes after it went OK -> CRITICAL. It gets really annoying as an admin. I cannot confirm that a notification rule is working properly, I begin questioning myself whether I configured something incorrectly and I am unable to debug this bizzare behaviour. Also detecting downtimes is really slow, users will detect the downtime 30-40 minutes faster before the admin team can receive an e-mail that a service is down. That is too long of a time frame.

There is no rule in the ruleset “Delay service notifications”.

We have a test instance of Check_MK and it works very well. It detects service transitions very fast and sends e-mails appropriately. This production instance is bigger and it seems to be struggling with something.

When I check var/log/notify.log I see hundreds or thousands of lines:

2020-07-29 16:28:27 Global rule ‘X1’…
2020-07-29 16:28:27 -> does not match: Notification has not been created by the Event Console.
2020-07-29 16:28:27 Global rule ‘X2’…
2020-07-29 16:28:27 -> does not match: The rule requires WATO folder ‘Y1’, but the host is in ‘Z1’
2020-07-29 16:28:27 Global rule ‘X3’…
2020-07-29 16:28:27 -> does not match: The rule requires WATO folder ‘Y2’, but the host is in ‘Z2’
2020-07-29 16:28:27 Global rule ‘X4’…
2020-07-29 16:28:27 -> does not match: The service is in no service group, but Z3 is required
2020-07-29 16:28:27 Global rule ‘X5’…
2020-07-29 16:28:27 -> does not match: The service is in no service group, but Z4 is required
2020-07-29 16:28:27 Global rule ‘X6’…
2020-07-29 16:28:27 -> does not match: The service is in no service group, but Z5 is required
2020-07-29 16:28:27 Global rule ‘X7’…
2020-07-29 16:28:27 -> does not match: The rule requires WATO folder ‘Y3’, but the host is in ‘Z6’
2020-07-29 15:34:48 Global rule ‘X8’…
2020-07-29 15:34:48 -> does not match: The service is only in the groups Y4, but Z7 is required

Could this be the problem ?
Is Check_MK not able to work with that many notification rules ? 150 is not that many in my opinion.
What can I do to fix these issues ?

Why do you need 150 notification rules?

These will sllow down notification delivery as you have already noted.

Usually there are like 10 rules max.

As addition to @r.sander i would also take a look at your check interval and check attempts.
In your screenshot you see that all the rules from X1 to X7 are processed at the same second.
Create a sample notification and take a look at your log how long the processing takes. If this takes a long time then you should reduce the number of rules.
But 30-40 minutes is another problem not only the rules.

What happens if you use the analysis function inside the rule based notifications?

Hi @r.sander and @andreas-doehler. Thanks for responding.

For different types of devices and for different servers for example different people need to be notified. All of them have different email addresses. So, we have like 1 notification rule per production server. Not all production servers have a notification rule for them, but most do, and different people/groups of people want to know if there is a problem with their server. Also, for some servers multiple people need to be notified with a different method like SMS, chat message etc. so sometimes there are 2-3 notification rules for certain servers.

I am really surprised that 10 notification rules should be the maximum. I thought that a big number of notification rules was normal.
Hypothetically, what if a cloud provider needed to inform a different customer that one of their services was down and they have like 1000 customers.

The check interval is 5 minutes, so checkmk detects relatively quickly that a service state has changed, but no notification gets sent when it detects it. It waits a long time to send the notification.

I am not sure what you mean by analysis function ? To write some messages to the log when a notification script gets called/reaches its end or something else ?

This is not the maximum but should be enough for normal system.

1 rule - assign contact to the hosts or services and only use the rule notify contact of object. That’s all.

Important is if you also define a “maximum number of check attempts” and the “retry check interval”
The time from first detection of a status to sent out notifications can be relatively long if there are the wrong values specified.
Example from my environment (this is my best practice setting) - host check is here “Ping” as example

Maximum number of check attempts for host - 3
Normal check interval for host checks - 1 min
Retry check interval for host checks - 1 min

Notification is sent latest after 3 minutes after the problem occurs, no notification is sent if the problem stays only for 2 minutes.
The same you need to check for the service checks.

If you go to “WATO” - “Notifications” you will find the button.
image
With this button you can replay notifications already processed by your system.
This is ideal to test the runtime of your rules.

1 Like

This is normal in quite every infrastructure. And this is why contacts and contact groups exists and the default notification rule will send mails to the contacts of the notified object (host or server).

There really is no need to have one notification rule per host.

Hmm, interesting. I didn’t see this as a possibility. I think we ignore contact groups altogether.

In my opinion this should be mentioned somewhere in the documentation about notifications, as a best practice. It is quite easy to start configuring check_mk in a non-recommended way because there are a lot of ways to do similar things in check_mk.

I will try to do some debugging with this “Analyse” feature or just log time of entry and time of exit of notification script. Maybe eventually migrate everything to this approach with 1 notification rule and “Notify all contacts of the notified host or service”, but this will be a pretty big migration.

You should first check what settings are done in your system for the parameter i mentioned in my last post.

If you have a check interval of 5 minutes and maximum number of attempts set to 4 and also a recheck interval of 5 minutes it takes between 15-20 minutes until the notification is sent to the users/admins.