Send any-> OK notification only if a delayed notification is send

birckoff · February 23, 2022, 3:23pm

Hello Checkmk community,

I used CheckMK Raw edition 2.0 p7 since a few months to monitor use of filesystems on some servers, and everything works as expected.
But sometimes, some big files are temporarly used, and trigger a notification. I saw in the documentation how to set a delay (5minutes) rule to avoid that, and i don’t receive the OK-> Crit alerts anymore, only if they are persistent.
But i still receive the Crit->OK ( or Warn->OK), sometimes 100 times during the week-end. Important notifications are much less visible then.
How could i do to receive any-> OK notifications ONLY when a previous alert was sent ?

Thank you for your help.

martin.schwarz · February 23, 2022, 3:43pm

Hi and welcome to the Checkmk community!

Instead of (or perhaps in addition to) just delaying the notification for a fixed amount of time, have a look at the Maximum number of check attempts for service rule set.

The service will go into a “soft” problem state for the first x check attempts. If it recovers before reaching max attempts, no notification will be sent (only on “hard” problem state).

More details can be found in the docs (section “Managing sporadic errors”):

birckoff · February 23, 2022, 4:13pm

Hi Martin,

thank you for your welcome

i tried this rule too, according to the documentation, prior to post my problem.
it worked, as i didn’t receive notifications for soft problems, but i still receive all the CRIT-> OK notifications.
I can’t just turn them off, because if i receive a notification for an hard problem, i should not receive the notification if the problem fix itself…

elias.voelker · February 23, 2022, 8:18pm

Just as an idea: what if you split this into two rules?

One for the OK → CRIT notification (using the Maximum number of check attempts for service rule set.).

And another one for the ANY → OK, using the Delay rule set.

You could then set the maximum number of attempts for Rule 1 to say 5 - so after 6 minutes you a assume it’s a real problem and trigger a notification

If you set the delay for Rule 2 to more than 5 times your check interval, it should only notify if the ANY state persists for more than the threshold for “real” problems.

Does that make sense?

birckoff · February 25, 2022, 8:59am

Hi Elias,

Thanks for your answer. i considered to split the rule into 2 delay notifications rules one for the ok->any and other for any->OK, but unfortunately, filesystem size notifications are quite random. Example:

1rst type of event :
0:00 : event OK->crit
0:05 : delay 5 mns so notification sent.
0:06 : event Crit->OK
0:11 : delay 5mn so notification sent.

2nd type of event :
11:00 event OK->crit
11:03 Crit-> OK (less than 5 mn, so no error notification sent)
11:08 OK notification sent.

if i don’t put the second rule, i receive OK notification as soon as it’s back. but i wasn’t informed that there was a problem before. and of course the second type of event occurs much more thant the first.

elias.voelker · February 25, 2022, 9:20am

Hi Julien,

in the 2nd type of event - my understanding would be that if you delay it with the Delay service notifications rule set, it should work like this:

You define a notification that should be triggered when your service goes from ANY (or CRIT) to OK
You define a delay of x minutes. Thinking about it, the delay should be set to a time between your “real problem” threshhold and the time it typically takes for your “not a real problem” services to come back up. So if this is normally the case within two minutes, then the delay should be 3 minutes.

The docs say this about delays:

a notification will be delayed until this time has expired. Should the OK / UP state occurs again before then, no notification will be triggered.

That way I understand your two cases would look like this:

1rst type of event (“REAL PROBLEM”):
0:00 : event OK->crit
0:05 : delay 5 mns so notification sent.
→ you fix the real problem
0:00: event Crit->OK
0:03 : delay 3mn so notification sent.

2nd type of event (“not a real problem”):
11:00 event OK->crit
11:02 Crit-> OK (less than 5 mn, so no error notification sent)
11:02 Crit-> OK (“recovery” notification prepared)
11:05 After delay of 3 minutes the Service is back to OK, so no OK notification sent.

I admit it’s not perfect, because you’re basically playing with the time between your “typical self-recovery time” and your “real problem threshold”. So if your self-recovery takes a bit longer than you estimated, then this won’t work. But maybe it’s worth a try?

All of this of course assumes that you can also delay ANY → OK notifications using the Delay service notifications rule set , if the service is OK by the end of the delay, (which defeats the purpose of the ANY → OK notification).

birckoff · February 25, 2022, 10:02am

Hi Elias, on the second type of event, the recovery notification is prepared AND sent (because in the same state for more than the delay), event if the error notification is not sent ; on 10 mails i receive from check_mk, i can have 1 real alert and 1 recovery for real alert, and 8 recovery for not real problems.

i think about a script which put a tag “notified” when there’s a real problem, and the notification is really sent, and another script who sends a recovery notification only if the “notified” tag is present, then delete the tag.
Do you think is it possible ?

elias.voelker · February 25, 2022, 10:16am

There I am out of my depth, unfortunately…

birckoff · February 25, 2022, 10:18am

No problem.
I will investigate about scripting

i thank you Elias and Martin, for your precious time.

robin.gierse · February 28, 2022, 7:13am

Have you looked into recurring, flexible downtimes?
You could make them start, so they cover the window in which the file system fills up and then have them only run for a few minutes.
Other idea: Why not size the file systems accordingly? Sure, you do not need the space all the time, but one day the volume will fill up completely and ruin your day.
Lastly, you could also try to change the file systems thresholds, perhaps depending on the time of day, so there are no CRIT states in the first place. Trying to avoid notifications is rarely the right approach.

system · February 28, 2023, 7:13am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.