Retry Check Interval Issues

TheRambler · February 14, 2025, 3:58pm

CMK version: Checkmk Raw Edition 2.3.0p26
OS version: Ubuntu 24.04

When using Ping as the Host Check Command, the configured Retry Check Interval appears to be adhered to, but when using “Status of Service Check_MK” or “Status of CheckMk Agent” the Retry Interval is sporadically ignored.

Experienced this with our live system and also with a test system I setup just to investigate this.
Eg, configure a Retry Interval of 30m with a Max Retry Count of 3, under Ping the host checks are carried out correctly every 30m. When using Check_Mk status the first two host checks occur typically within a minute of each other, the last sometimes waits the 30m sometimes does not.

Some logs from my testing
|Time|Event|Host|Service|State info|
** Using Ping as host check command, 5m retry interval **
|2025-02-12 15:29:46 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:34:49 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:39:52 - 2 d|HOST ALERT|TestVM||HARD (DOWN)|

|2025-02-12 15:52:15 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:57:18 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:02:21 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of Service: Check_MK as host check command, 5m retry interval
|2025-02-12 16:19:48 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:20:04 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:25:05 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|

|2025-02-12 17:11:13 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:11:51 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:16:52 - 46 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of CheckMk Agent as Host Check Command, 5m retry interval
|2025-02-13 12:41:10 - 27 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:41:51 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:46:52 - 26 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of CheckMk Agent as Host Check Command, 30m retry interval
|2025-02-13 12:56:01 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:26:02 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:56:03 - 25 h|HOST ALERT|TestVM||HARD (DOWN)|

|** Using Status of Service: Check_MK as Host Check Command, 30m retry interval|||||
|2025-02-14 15:06:16 - 35 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:06:51 - 34 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:29:25 - 11 m|HOST ALERT|TestVM||HARD (DOWN)|

Admittedly, on my live system the 3rd check is way more sporadic than in my testing, only once in my testing was the 3rd check off time. There’s also an issue with Delay Notification but I see that’s already been reported a few times.

Where this becomes a problem is we’re receiving a lot of host down alerts because the host checks are completing their 3 attempts within the 30 minutes retry interval of the Check_Mk service, so essentially one missed Check_Mk check (when the second one succeeds) is giving us false host downs.

andreas-doehler · February 15, 2025, 5:47am

This is very normal for Nagios core (RAW edition).
The core triggers a host check if an active check (Check_MK service) fails.

But i would really think about your configuration - 30 minutes retry interval makes no sense.

TheRambler · February 15, 2025, 2:36pm

Hi,

I did consider this but the timings of the service checks to the host checks does not marry up with this theory. Take the following examples

Time	Event	Host	Service	State info	Summary
2025-02-13 12:22:41 - 2 d	SERVICE ALERT	SomeServer1	Check_MK	SOFT (CRITICAL)	Check_Mk Down
2025-02-13 12:22:42 - 2 d	HOST ALERT	SomeServer1		SOFT (DOWN)	Host check off back of Check_Mk status change
2025-02-13 12:23:48 - 2 d	HOST ALERT	SomeServer1		SOFT (DOWN)	Random 2nd host check 1m6s later
2025-02-13 12:52:43 - 2 d	HOST NOTIFICATION	SomeServer1		NOTIFY (DOWN)
2025-02-13 12:52:43 - 2 d	HOST ALERT	SomeServer1		HARD (DOWN)	3rd Host check 28m55s after last, or 30m1s after first host check
2025-02-13 12:52:45 - 2 d	HOST NOTIFICATION RESULT	SomeServer1		EXIT_CODE (SUCCESS)
2025-02-13 12:52:45 - 2 d	HOST NOTIFICATION	SomeServer1		NOTIFY (DOWN)
2025-02-13 12:52:45 - 2 d	SERVICE ALERT	SomeServer1	Check_MK	SOFT (OK)	Check_MK 2nd check 30m4s after last, all OK

2025-02-14 06:30:08 - 31 h	SERVICE ALERT	SomeServer2	Check_MK	SOFT (CRITICAL)	Check_Mk Down
2025-02-14 06:30:09 - 31 h	HOST ALERT	SomeServer2		SOFT (DOWN)	Host check off back of Check_Mk status change
2025-02-14 06:46:18 - 31 h	HOST ALERT	SomeServer2		SOFT (DOWN)	Random 2nd host check 16m9s later
2025-02-14 07:00:06 - 30 h	SERVICE ALERT	SomeServer2	Check_MK	SOFT (OK)	Check_Mk 2nd check 29m2s after last, all OK
2025-02-14 07:00:07 - 30 h	HOST NOTIFICATION	SomeServer2		NOTIFY (DOWN)
2025-02-14 07:00:07 - 30 h	HOST ALERT	SomeServer2		HARD (DOWN)	3rd host check 13m49s after last, or 29m58s after first host check

2025-02-15 12:03:05 - 106 m	SERVICE ALERT	SomeServer3	Check_MK	SOFT (CRITICAL)	Check_Mk Down
2025-02-15 12:03:06 - 106 m	HOST ALERT	SomeServer3		SOFT (DOWN)	Host check off back of Check_Mk status change
2025-02-15 12:28:39 - 81 m	HOST ALERT	SomeServer3		SOFT (DOWN)	Random host check 25m33s later
2025-02-15 12:33:39 - 76 m	SERVICE ALERT	SomeServer3	Check_MK	SOFT (OK)	Check_Mk 2nd check 30m34s after last, all OK
2025-02-15 12:33:40 - 76 m	HOST NOTIFICATION	SomeServer3		NOTIFY (DOWN)
2025-02-15 12:33:40 - 76 m	HOST ALERT	SomeServer3		HARD (DOWN)	3rd host check 5m1s after last, or 30m35s after 1st

2025-02-13 08:45:10 - 2.2 d	SERVICE ALERT	SomeServer4	Check_MK	SOFT (CRITICAL)	Check_Mk Down
2025-02-13 08:45:11 - 2.2 d	HOST ALERT	SomeServer4		SOFT (DOWN)	Host check off back of Check_Mk status change
2025-02-13 08:45:33 - 2.2 d	HOST ALERT	SomeServer4		SOFT (DOWN)	Random 2nd host check 22s later
2025-02-13 08:52:42 - 2.2 d	HOST NOTIFICATION	SomeServer4		NOTIFY (DOWN)
2025-02-13 08:52:42 - 2.2 d	HOST ALERT	SomeServer4		HARD (DOWN)	3rd host check 7m9s after last, or 7m31s after first
2025-02-13 08:52:44 - 2.2 d	HOST NOTIFICATION RESULT	SomeServer4		EXIT_CODE (SUCCESS)
2025-02-13 08:52:44 - 2.2 d	HOST NOTIFICATION	SomeServer4		NOTIFY (DOWN)
2025-02-13 09:15:09 - 2.2 d	SERVICE ALERT	SomeServer4	Check_MK	SOFT (OK)	Check_Mk 2nd check 29m59s after last, all OK

As you can see above, there is always a host check immediately following the Check_Mk service status change. There is always then a random 2nd check at varying intervals without any Check_Mk checks surrounding it. The 3rd host check almost always runs ~30m after the first host check, but as can be seen in the final example this is not always the case.
So there definitely seems to be some issue with the 2nd host check given that firstly it doesn’t have any Check_Mk service checks surrounding it and secondly it’s occuring a varying, wide ranging intervals. Also the 3rd host check does seem to have an issue too, albeit less frequent, given that in the final example here it’s occuring less than 10 minutes after both the first 2 checks and more than 5m after/before any Check_Mk checks.

As for the 30m retry interval, whilst I don’t feel it’s relevant to this subject and not sure why you felt the need to comment on it, to explain this situation relates to our SNMP checks of server IPMIs. We have the standard check interval set to every 2 hours. We’re monitoring IPMI primarily for disk failures, additionally for CRC errors, PSU errors ect ect, ultimately nothing we need to be notified of immediately. If a server has an issue that causes it to go offline or for performance to be seriously degraded we’ll be notified within 15 minutes from our other monitoring so this monitoring is very low priority and as such I have set a 2hr standard interval with a 30m retry interval to conserve both on system resources and bandwidth. Make sense to you now?

andreas-doehler · February 15, 2025, 4:28pm

If you really want to work with such large intervals then it is better to use the “Delay host/service notifications”. As i said, the automatic host check in case of a service problem cannot be controlled by any setting. This is classic Nagios behavior.