CheckMK Version 2.3.0p27
After working on and off with CheckMK for a few years (and having a training for it), we are using it for the monitoring of a little number of PostgreSQL database servers. The servers are of different criticalities, we have test, integration and production servers. For the notifications we want the production servers to alert 24x7, all the other servers only during business hours.
For that, additional to the 24x7 time period we have created one for the business hours, listing business days (Mo-Fr) and business hours (6:30-15:30).
For all the servers that are in the “DB servers” folder we have configured them as either “Production”, “Test” or “Integration” criticality, using the Host tag that is pre-configured.
So, with all this in place, 2 host notification rules were created: One checking the “Production” host tag, alerting 24x7, one that checks that it is not a “Production” system and alerts only during business hours - matched by host event types. A third rule for services was created, relying on the time period configured on the services, matching by relevant service event types.
We thought that this setup is straight forward and should do what we expect it to do. However, somehow we get host alerts during night hours for non production systems. We assume that is because on the server the notification period of all servers is on 24x7. We thought that the rule on when to alert would overwrite this:
Match host tags → Criticality → Is → Production System
Match only during time period → 24x7
Match host tags → Criticality → IsNot → Production System
Match only during time period → office_hours
But that doesn’t seem to work like we expect it, which means our expectations are probably wrong. So that means that we have to change the setup to prevent these false-positives. But which way to go?
- Find out how to change the check period on hosts and change for all the non production hosts to office_hours
- Somehow fix the notification rules to not alert during the night for non production systems
We’d be grateful for tips on how to do either. Although it would be interesting to learn why our setup fails.