Only send notifications after service is at least 10 minutes continuous in CRIT

Dear CheckMK forum community,

I have a few temperature sensors in servers.
Monitoring works no problem, but I recieve Notifications for every service state change.

Example
Whenever a server temperature reaches WARN or CRIT even for a second I get a notification.

I’d like it to send notifications after 10 minutes continuous in CRIT state.
It should not send a notification if the state changes from CRIT to WARN and back to CRIT, instead on reaching WARN the “10 minute timer” should stop and upon reaching another CRIT a new 10 minute timer should be starting.

After all, being over 80°C (which is standard CRIT value for temperature) for a second isn’t harmful at all and I don’t need to know that.

Does somebody know how to achieve that?

Thanks in advance,
pixelpoint

Hi,

In the notifications settings set to alert on nth attempt.
image

I would suggest using the rule “Delay service notifications”. It does exactly what you want. It delays the notification for a specified amount of time. If the service returns to OK during this time, there will be no notification.
Regards
Udo

Sorry for the slow reply.

I tried “Delay service notifications”, but that makes it only delay the notifications until they are OK again.
Imagine this situation:

  • [00:00] Service CRIT
  • [00:05] Service WARN
  • [00:10] Service still WARN → send notification

What I wanted to do was this:

  • [00:00] Service CRIT
  • [00:07] Service WARN (<-- stop timer here, as temperature is not CRIT anyore)
  • [00:09] Service CRIT (<-- start timer here again, as temperature reached CRIT)
  • [00:19] Service still CRIT (<-- send notification because temperature has been CRIT for 10 minutes)

Thank you, I will try the following notification rule:

Match only the following services: Temperature
Match service event type: any -> OK // any -> CRIT
Restrict to n-th to m-th notification: 10 to 11

I will report back tomorrow.

Thank you for your help.

Sadly, the proposed solution with the notification rule restrict n-th to m-th notification does not work as well.
Giving it a thought or two, I think I know why it wouldn’t work:

  1. The Temperature Service only creates 1 notification per service change, so Notification 10 and 11 never occur.
  2. Even if it would have a periodic service notification every minute, it would still be sending out notifications for WARN and CRIT without taking the 10 minutes CONTINOUS CRIT STATE into account.

With the rule “periodic service notification” it’s also not possible to say “only for CRIT state”.

Somebody else have some kind of idea on this?

Screenshots of the notifications + time when they arrived:
checkmk temperature mails

Woldn’t it be easier to just adjust the warn/crit limits to values that work for you?

Personally, I like to have WARN early enough to be aware of somehting unusual before it has noticeable impact.

1 Like

The thing with this server is:

  • He gets warm every now and then
  • We observed how this happens (management jobs, cleanup jobs, backup jobs, many simultanous user queries, etc etc)
  • Except for the planned jobs we cannot set a service period because user queries happen randomly
  • 80°C is okay for CRIT
  • But we don’t want to recieve 30 mails a day whenever the temperature exceeds the threshold for a very limited time

That’s why we would like to recieve mails but only if the temperature exceeds the limit for too long.

I could set CRIT for 90°C, but we wouldn’t get any notifications at all (or close to any, the server reached 90°C for a few seconds once in the last months).

You could use the “Maximum number of check attempts for service” ruleset:
The maximum number of failed checks until a service problem state will be considered as hard. Only hard state trigger notifications.

2 Likes

I’m now trying a combination of rules for this server.

Notified Events for Service

  • any -> CRIT
  • any -> OK

This should only send CRIT and OK notifications.

Enable / Disable Flapping Detection for Service

  • Disable flapping detection

The state was regularly going into flapping state because of sudden temperature changes.

Service Period for Services

  • With this rule I setup a service / backup / cleanup time period in which there will be no notifications sent

Many of the notifications I recieve are from the service period at night, where backup and cleanup jobs are running.

Maximum Number of Check Attempts for Service

  • This rule is set to 10 attempts for all “Temperature Zone” Services

With all of these rules combined, I should have the following setup

  • Only Recieve CRIT or OK messages
  • Do not recieve any messages at night (cleanup and backup timeperiod)
  • Only recieve messages after 10 check retries (10 minutes)
  • No Flapping detection and notification

If this is not enough, I will probably disable WARN (if possible) values for this Server.
This should lead to only OK->CRIT and CRIT->OK events.

I will let this run for a few days and will report back afterwards.
Thank you everyone for your help.

I’m very sorry for being so late, I totally forgot about this thread.
I’m marking my last answer as solution, because it worked (kinda).

The notifications have reached acceptable numbers with these rules.

Thank you all for your help and suggestions!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.