Background:
In my company we have a Check_MK instance running on a physical server.
OS: CentOS 7, latest updates
Check_MK version: check-mk-enterprise-1.6.0p12
We are using this server heavily for monitoring the whole infrastructure.
We have ~400 groupings of services to service groups and ~50 groupings of hosts to host groups.
We have ~150 notification rules set up, most of which just send e-mails.
We have ~40000 services.
The server where Check_MK is running is physical, it has a big amount of RAM, most of which is unused and it has multiple CPUs (not sure on the exact number), they are also not used that much.
Problem:
What we are experiencing seems to be delays on notifications. Notifications get sent by Check_MK too late. I received a notification about a service 40 minutes after it went OK → CRITICAL. It gets really annoying as an admin. I cannot confirm that a notification rule is working properly, I begin questioning myself whether I configured something incorrectly and I am unable to debug this bizzare behaviour. Also detecting downtimes is really slow, users will detect the downtime 30-40 minutes faster before the admin team can receive an e-mail that a service is down. That is too long of a time frame.
There is no rule in the ruleset “Delay service notifications”.
We have a test instance of Check_MK and it works very well. It detects service transitions very fast and sends e-mails appropriately. This production instance is bigger and it seems to be struggling with something.
When I check var/log/notify.log I see hundreds or thousands of lines:
2020-07-29 16:28:27 Global rule ‘X1’…
2020-07-29 16:28:27 → does not match: Notification has not been created by the Event Console.
2020-07-29 16:28:27 Global rule ‘X2’…
2020-07-29 16:28:27 → does not match: The rule requires WATO folder ‘Y1’, but the host is in ‘Z1’
2020-07-29 16:28:27 Global rule ‘X3’…
2020-07-29 16:28:27 → does not match: The rule requires WATO folder ‘Y2’, but the host is in ‘Z2’
2020-07-29 16:28:27 Global rule ‘X4’…
2020-07-29 16:28:27 → does not match: The service is in no service group, but Z3 is required
2020-07-29 16:28:27 Global rule ‘X5’…
2020-07-29 16:28:27 → does not match: The service is in no service group, but Z4 is required
2020-07-29 16:28:27 Global rule ‘X6’…
2020-07-29 16:28:27 → does not match: The service is in no service group, but Z5 is required
2020-07-29 16:28:27 Global rule ‘X7’…
2020-07-29 16:28:27 → does not match: The rule requires WATO folder ‘Y3’, but the host is in ‘Z6’
2020-07-29 15:34:48 Global rule ‘X8’…
2020-07-29 15:34:48 → does not match: The service is only in the groups Y4, but Z7 is required
Could this be the problem ?
Is Check_MK not able to work with that many notification rules ? 150 is not that many in my opinion.
What can I do to fix these issues ?