CheckMK 2.2 - Delay service notifications doesn't work

marczako · October 27, 2023, 9:12pm

Checkmk Raw Edition 2.2.0p12
Ubuntu 22.04.3 LTS (Jammy Jellyfish)

Hello,

I want to setup delay notifications for services.
I’ve made a test on one host in folder “Passive”.

I have configured rule for e-mail notification and it works.
I have configured delay services notification for 5 minutes per “Passive” folder where this host exist.
In parameters of services on this specific host I can see “Delay service notifications Rule 1 in Pasive 5 minutes” and in rules i see that it matches to the host and service.

When service change state notification appears immediately. In events of services I can see:

Events od service:
Final notification result 38 m SERVICE NOTIFICATION RESULT EXIT_CODE (SUCCESS) Spooled mail to local mail transmission agent

User notification 38 m SERVICE NOTIFICATION NOTIFY (WARNING) Total CPU: 9.93% (warn/crit at 4.00%/99.00%)WARN

Core produced a notification 38 m SERVICE NOTIFICATION NOTIFY (WARNING) Total CPU: 9.93% (warn/crit at 4.00%/99.00%)WARN

Service alert 38 m SERVICE ALERT

What I do wrong or what changes in 2.2?
I have another monitoring with version 2.1 and there it’s working like should…

jhouxatjvx · February 1, 2024, 6:57pm

I’m new to checkmk, but there must surely be some bugs in the way the notification delay works.

We found that if you name a host with something other than the FQDN, the delay notifications rule does not apply at all. Notifications are instant.

After switching to FQDN hostnames, the notification delay starts working. However, the delay interval is erratic. When set for 5 minutes, the notifications trigger anywhere between 1min and 39 seconds and up to 5 minutes with most of the notifications triggering between 2 and half to 4 and a half minutes. Its just seriously erratic.

I haven’t seen anyone else complain about the erratic nature. But given that we proved there’s a disparity between FQDNs and non-FQDNs, and the current thread OPs report of a change in behavior between versions, it seems clear that fresh bugs were introduced in recent history.

MarsellusWallace · February 8, 2024, 1:25pm

Hi James (@jhouxatjvx) ,

which version and edition are you using? I’d like to try to reproduce this in order to have more information…

@marczako: you did not re-asked since October, what is your current situation regarding this? In case delayed notifications still do not seem to work, please report your currently used version and edition, too.

Best regards,
Marsellus W.

marczako · February 8, 2024, 7:56pm

I stoped using e-mail notifications in 2.2…

jhouxatjvx · February 9, 2024, 2:00pm

Raw edition. Also, if you search these forums, you will find that there are other posts with people complaining about notification behavior changing between 2.1 and 2.2. They had working systems that broke when they moved to 2.2.

MarsellusWallace · February 21, 2024, 8:07am

Hi guys,

apologize for the delay, but we have a lot of work and I now start the repro of this:

CMK RAW (CRE) 2.2.0p12
two hosts, 1* shortname, 1* FQHN
two local check services each: 1* always OK, 1* dynamic state
delay for service notifications set to 5 mins

I will test the following for each of the two hosts:

service state changes to non-OK and recovers to OK within <5 mins => no notif should get triggered
service changes to non-OK and stays in this state for a longer time => notif should get triggered after 5 mins
service recovers to OK after notif has been triggered => notif should get triggered, because the delay is only for PROBLEM notifs, not RECOVERY (I assume this )

After those tests I will update to latest RAW version (2.2.0p22) and replay the tests, in the end updating the site to latest Enterprise Edition (CEE) and replaying again…

I’ll get back to you with my findings!

BR,
Marsellus W.

MarsellusWallace · February 27, 2024, 5:26am

Moin guys,

we found the following:

In RAW edition (or when using Nagios core in commercial editions), only the first notification (OK → non-OK) is delayed - full Nagios compatibility.
When using the Checkmk MIcro Core, all notifications get delayed.
Recovery Notifications never get delayed in neither edition.

I was unable to reproduction of any of your reported issues, neither in 2.2.0p12 nor 2.2.0p22.

BR,
Marsellus W.

Niclas1 · April 10, 2024, 3:19pm

@MarsellusWallace i have the same Problem with CheckMK Raw 2.2.0p24.
Did you try it with an distributed Environment? (Hostname and checkmk Hostname is not the same)
“Maximum number of check attempts for service” works, but is not the same.

MarsellusWallace · April 10, 2024, 4:23pm

Hi @Niclas1,

As I wrote when using Nagios core only the first/initial notification (from OK to non-OK) is delayed. No other notifications.

I do not know your complete configuration and your exact issue, but that were my findings and maybe your delay issue is affected by this (and therefore would not be an issue but expected)…

Have a look at the events of the affected host and it’s services, usually the solution can be found there…

BR,
Marsellus W.

Niclas1 · April 10, 2024, 5:17pm

Thank @MarsellusWallace

I apologize, since i updated from p17 to p24 it works (without changes, never mind ).
Maybe it has reapplied after set it to a specific Host and back…

But while testing today i recognized that the delay also reapply if a service changes state: For example: OK->WARN(5min.) WARN->CRIT(new 5min.) (the first 5min. were not finished before got into CRIT state) → Notification around 9 min. later.

Best regards,
Niclas