BUG? - Issues with "notification count" for escalation

frakka · April 13, 2022, 4:24pm

Hi all,
I’m on 2.0.0p22 (CRE) and I’m facing some issue to understand how to manage the “Restrict to notification number” params to get the Desidered escalation notification.

I notice the value of “SERVICENOTIFICATIONNUMBER” increase on each service status change even if no notification rules has sent effectively a notification to anyone.

For example:
I configured a notification rules to send an sms only on some services when they becomes CRIT or goes OK and I excluded the WARN state because I’m not interested to it during nighttime.
I added a rule to add a “Periodic notifications during service problems” after 60 minutes and another rules to notify via email “Restrict to notification number” “3 … 99999” to different conctats.

I expected that the first notification was sent when service goes to CRIT state, the second one if it cames back to “OK” or if it is still on CRIT state after an hour from the first notification and that a third notification were sent by email if no one has fixed or acked the service state within an hour from the second notification.

Instead seems that the value of “SERVICENOTIFICATIONNUMBER” increases on each service status transition between WARN or CRIT state, independently from that transition has effectively generated any kind of notification.

So if a service from “OK” goes to “WARN” and than goes to “CRIT” and backs to “WARN” I get the escalation email sent even if only 5 minutes was spent since the service was on “OK” state.
In this ways the escalation emails goes even for services that are not configured to send notification at all (ok, this can be worked aroud applying the same filter also on the escalation notification but is a duplication).

I think that the “SERVICENOTIFICATIONNUMBER” (as the “HOSTNOTIFICATIONNUMBER”) should increase only if that transition matches a notification rules.
I’m wrong?

andreas-doehler · April 13, 2022, 8:11pm

I think this is not easily achievable. The problem here is the following.

The core logic

State change from OK → WARN - the core triggers the first notification (Nr.1) - the core sends this to the notification subsystem and don’t knows if there is are real notification going outside.
Next state change from WARN → CRIT - for the core this is the second notification for this problem.
And so on until the service reaches again the OK state. Then all starts again from 0.

But here is a solution for this problem
With the rule “Notified events for services” you can define on a per service basis what should be notified directly from the core. That means you can say only crit and ok should trigger a notification.
With this setting also your count should not increase so quickly.
I don’t know how it counts with a status change like - ok → warn → crit - > warn → crit

frakka · April 14, 2022, 10:20am

I’m not a developer so I don’t understand how much difficult is but I think that those counters should be incresed by the notification subsystem and not by the core (or that a different counter should exists if “SERVICENOTIFICATIONNUMBER” and “HOSTNOTIFICATIONNUMBER” are used even for other use).
In this way is very difficult (impossible I think) to configure a reliable escalation procedure based on notification numbers because those counters dont’ count the “notifications” but the “state transitions” of the services.

Ok but If I disable (for example) the notification for transition to warning state I loss this events also on the checkmk dashboard, right? And this rule is not configurable on time periods basis.

I’ll try.

andreas-doehler · April 14, 2022, 11:16am

This is normally visible inside the dashboard only the notification will not be generated.

frakka · April 20, 2022, 1:04pm

Hi, andreas.

I created a test services.
I created a rule to notify about that service only when it goes into critical state, recovers to OK, Start or end of a scheduled downtime
I created a notification rule for that service and a periodic notifications during service problems to notify every 5 minutes
I created an escalation rule that applies starting from the 3rd notification.

If service do this walk: OK → CRIT → OK
Service changes state at 2022-04-19 17:35. The service state changes immediatly but the first notification didn’t appear untill 2022-04-19 17:45:19 with “SERVICENOTIFICATIONNUMBER 1”. I don’t know what this delay is due to, no bulking notification should exists for this service (and interval checks are at 1 minutes).
“Periodic notifications during service problems” sends new notification after 5 minutes from the first notification as expected at 17:50:19, 17:55:19, 18:00:19 and 18:05:19. The later as “SERVICENOTIFICATIONNUMBER 5” as expected.
Service was recovered to “OK” at 18:06 and immediatly a notification appears as “SERVICENOTIFICATIONNUMBER 6” and the email about recovery was sent as expected.
I received the escalation notification for SNN 3,4,5 and 6.
If service do this walk: OK → WARN → OK
Service changes state at 2022-04-19 18:15 and goes to WARN.
I waited until “The last time the service was OK goes at 12 mimutes” and at 18:28 I recoved the services to “OK”.
I don’t receive a notification at all, neither when the service recovers to OK. Unexpected but good.
If service do this walk: OK → WARN → CRIT → OK
Service changes state at 2022-04-19 18:34 and goes to WARN. I waited some minutes (no notifications, as expected) and at 18:39 I moved the service to CRIT state. Again, no notification was raised untill 2022-04-19 18:44:13 when checkmk reports that “The age of the current service state → 4 m” and “The last time the service was OK → 10 m”. This notification comes with SERVICENOTIFICATIONNUMBER 1.
The second notifrication was sent at 2022-04-19 18:49:13 with “SERVICENOTIFICATIONNUMBER 2” and at 18:54:13 I got the 3rd notification and the first escalation so at 2022-04-19 18:56:14 I moved the service to “OK” and I got the 4th notification, as expected.
If service do this walk: OK → WARN → CRIT → WARN → CRIT → WARN → OK
At 18:57 I moved the service to WARN and after a short time to CRIT. I waited the first notification again untill “The last time the service was OK → 10 m” and at 2022-04-19 19:07:14 it comes. I waited until the first escalation (at 2022-04-19 19:18:13 with SERVICENOTIFICATIONNUMBER 3) and the I moved the service to WARN state at 19:19. At 19:24 I moved the service to CRIT again and at 2022-04-19 19:25:30 I got the 4th notification about a transition “? → CRITICAL” with SERVICENOTIFICATIONNUMBER 4.
Than I moved the service to “WARN” state and I didn’t get new notification, neither the “periodic notification during service problems”. At 19.32 I moved the service to “OK” and it was detected as “flapping” so I didn’t get any new notifications at all … Even when the flapping state was ended, I didn’t get any notification about the service when it recovers to OK and this is not as expected.
Not so good.
I redone the last walk (OK → WARN → CRIT → WARN → CRIT → WARN → OK) disabiling flap detection.
At last transition from WARN to “OK” without flap detection I got both notifications (periodic and escalation) for “WARN → OK”. So it is not consistent with the second test and, also strange, is not “? → OK” as when changes from “WARN → CRIT” ("? → CRIT" in yesterday’s test).

So in my case using the “Notified events for services” (and the “Notified events for hosts” too, I suppose, but I need to test it) can workaround the issue but is not a nice solutions because this is a setup “all-in” and no one can longer be notified about, for example, a WARNING state of services and hosts matched by those rules, for example, during a different time period.