BUG? - Issues with "notification count" for escalation

frakka · April 20, 2022, 1:04pm

Hi, andreas.

I created a test services.
I created a rule to notify about that service only when it goes into critical state, recovers to OK, Start or end of a scheduled downtime
I created a notification rule for that service and a periodic notifications during service problems to notify every 5 minutes
I created an escalation rule that applies starting from the 3rd notification.

If service do this walk: OK → CRIT → OK
Service changes state at 2022-04-19 17:35. The service state changes immediatly but the first notification didn’t appear untill 2022-04-19 17:45:19 with “SERVICENOTIFICATIONNUMBER 1”. I don’t know what this delay is due to, no bulking notification should exists for this service (and interval checks are at 1 minutes).
“Periodic notifications during service problems” sends new notification after 5 minutes from the first notification as expected at 17:50:19, 17:55:19, 18:00:19 and 18:05:19. The later as “SERVICENOTIFICATIONNUMBER 5” as expected.
Service was recovered to “OK” at 18:06 and immediatly a notification appears as “SERVICENOTIFICATIONNUMBER 6” and the email about recovery was sent as expected.
I received the escalation notification for SNN 3,4,5 and 6.
If service do this walk: OK → WARN → OK
Service changes state at 2022-04-19 18:15 and goes to WARN.
I waited until “The last time the service was OK goes at 12 mimutes” and at 18:28 I recoved the services to “OK”.
I don’t receive a notification at all, neither when the service recovers to OK. Unexpected but good.
If service do this walk: OK → WARN → CRIT → OK
Service changes state at 2022-04-19 18:34 and goes to WARN. I waited some minutes (no notifications, as expected) and at 18:39 I moved the service to CRIT state. Again, no notification was raised untill 2022-04-19 18:44:13 when checkmk reports that “The age of the current service state → 4 m” and “The last time the service was OK → 10 m”. This notification comes with SERVICENOTIFICATIONNUMBER 1.
The second notifrication was sent at 2022-04-19 18:49:13 with “SERVICENOTIFICATIONNUMBER 2” and at 18:54:13 I got the 3rd notification and the first escalation so at 2022-04-19 18:56:14 I moved the service to “OK” and I got the 4th notification, as expected.
If service do this walk: OK → WARN → CRIT → WARN → CRIT → WARN → OK
At 18:57 I moved the service to WARN and after a short time to CRIT. I waited the first notification again untill “The last time the service was OK → 10 m” and at 2022-04-19 19:07:14 it comes. I waited until the first escalation (at 2022-04-19 19:18:13 with SERVICENOTIFICATIONNUMBER 3) and the I moved the service to WARN state at 19:19. At 19:24 I moved the service to CRIT again and at 2022-04-19 19:25:30 I got the 4th notification about a transition “? → CRITICAL” with SERVICENOTIFICATIONNUMBER 4.
Than I moved the service to “WARN” state and I didn’t get new notification, neither the “periodic notification during service problems”. At 19.32 I moved the service to “OK” and it was detected as “flapping” so I didn’t get any new notifications at all … Even when the flapping state was ended, I didn’t get any notification about the service when it recovers to OK and this is not as expected.
Not so good.
I redone the last walk (OK → WARN → CRIT → WARN → CRIT → WARN → OK) disabiling flap detection.
At last transition from WARN to “OK” without flap detection I got both notifications (periodic and escalation) for “WARN → OK”. So it is not consistent with the second test and, also strange, is not “? → OK” as when changes from “WARN → CRIT” ("? → CRIT" in yesterday’s test).

So in my case using the “Notified events for services” (and the “Notified events for hosts” too, I suppose, but I need to test it) can workaround the issue but is not a nice solutions because this is a setup “all-in” and no one can longer be notified about, for example, a WARNING state of services and hosts matched by those rules, for example, during a different time period.