Notifications are sent out before "first notification delay" ends

CMK version: CRE 2.3.0p19
OS version: Ubuntu 20.04

I have a script to change the first notification delay for hosts and services (it uses the REST API for that). I’ve used it to set 30 minutes and verified that it works by looking at Setup > Services > Service monitoring rules > Delay service notifications. Yet the notifications are being sent after 24 minutes.
This is not a big deal obviously as I could just make it 36 minutes instead, I just find it quite puzzling how notifications can possibly being sent before the configured amount of time has passed.

Examples from nagios logs:

[Sun Oct 27 02:56:23 2024] SERVICE ALERT: foo01;fail2ban_errors;WARNING;SOFT;1;check text
[Sun Oct 27 02:58:23 2024] SERVICE ALERT: foo01;fail2ban_errors;WARNING;SOFT;2;check text
[Sun Oct 27 03:00:23 2024] SERVICE ALERT: foo01;fail2ban_errors;WARNING;HARD;3;check text
[Sun Oct 27 03:24:23 2024] SERVICE NOTIFICATION: check-mk-notify;foo01;fail2ban_errors;WARNING;check-mk-notify;check text
[Fri Oct 25 00:26:02 2024] SERVICE ALERT: foo02;MySQL Instance mysql;UNKNOWN;SOFT;1;Item not found in monitoring data
[Fri Oct 25 00:28:05 2024] SERVICE ALERT: foo02;MySQL Instance mysql;UNKNOWN;SOFT;2;Item not found in monitoring data
[Fri Oct 25 00:29:58 2024] SERVICE ALERT: foo02;MySQL Instance mysql;UNKNOWN;HARD;3;Item not found in monitoring data
[Fri Oct 25 00:54:08 2024] SERVICE NOTIFICATION: check-mk-notify;foo02;MySQL Instance mysql;UNKNOWN;check-mk-notify;Item not found in monitoring data

As I understand it, without a delay, notifications should be sent the moment a check goes HARD, so if I set a delay, it should delay the notification from when a check went HARD. Looking at the first example, it became HARD at 03:00:23, so using a 30 minute first notification delay, the notification should be sent at about 03:30:23. It was sent at 03:24:23 though.

Is this a bug or am I missing something?

Not exactly - from the original Nagios definition comes this

first_notification_delay:	This directive is used to define the number of "time units" to wait before sending out the first problem notification when this service enters a non-OK state. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will start sending out notifications immediately.

Here i would say it is more time since last time ok state. That would be valid in all your logs. Every log shows 30 minutes since last ok.

2 Likes

Right, that makes sense. Thanks!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.