Retry Check Interval Issues

CMK version: Checkmk Raw Edition 2.3.0p26
OS version: Ubuntu 24.04

When using Ping as the Host Check Command, the configured Retry Check Interval appears to be adhered to, but when using “Status of Service Check_MK” or “Status of CheckMk Agent” the Retry Interval is sporadically ignored.

Experienced this with our live system and also with a test system I setup just to investigate this.
Eg, configure a Retry Interval of 30m with a Max Retry Count of 3, under Ping the host checks are carried out correctly every 30m. When using Check_Mk status the first two host checks occur typically within a minute of each other, the last sometimes waits the 30m sometimes does not.

Some logs from my testing
|Time|Event|Host|Service|State info|
** Using Ping as host check command, 5m retry interval **
|2025-02-12 15:29:46 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:34:49 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:39:52 - 2 d|HOST ALERT|TestVM||HARD (DOWN)|

|2025-02-12 15:52:15 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:57:18 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:02:21 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of Service: Check_MK as host check command, 5m retry interval
|2025-02-12 16:19:48 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:20:04 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:25:05 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|

|2025-02-12 17:11:13 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:11:51 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:16:52 - 46 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of CheckMk Agent as Host Check Command, 5m retry interval
|2025-02-13 12:41:10 - 27 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:41:51 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:46:52 - 26 h|HOST ALERT|TestVM||HARD (DOWN)|

** Using Status of CheckMk Agent as Host Check Command, 30m retry interval
|2025-02-13 12:56:01 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:26:02 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:56:03 - 25 h|HOST ALERT|TestVM||HARD (DOWN)|

|** Using Status of Service: Check_MK as Host Check Command, 30m retry interval|||||
|2025-02-14 15:06:16 - 35 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:06:51 - 34 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:29:25 - 11 m|HOST ALERT|TestVM||HARD (DOWN)|

Admittedly, on my live system the 3rd check is way more sporadic than in my testing, only once in my testing was the 3rd check off time. There’s also an issue with Delay Notification but I see that’s already been reported a few times.

Where this becomes a problem is we’re receiving a lot of host down alerts because the host checks are completing their 3 attempts within the 30 minutes retry interval of the Check_Mk service, so essentially one missed Check_Mk check (when the second one succeeds) is giving us false host downs.

This is very normal for Nagios core (RAW edition).
The core triggers a host check if an active check (Check_MK service) fails.

But i would really think about your configuration - 30 minutes retry interval makes no sense.

1 Like

Hi,

I did consider this but the timings of the service checks to the host checks does not marry up with this theory. Take the following examples

Time Event Host Service State info Summary
2025-02-13 12:22:41 - 2 d SERVICE ALERT SomeServer1 Check_MK SOFT (CRITICAL) Check_Mk Down
2025-02-13 12:22:42 - 2 d HOST ALERT SomeServer1 SOFT (DOWN) Host check off back of Check_Mk status change
2025-02-13 12:23:48 - 2 d HOST ALERT SomeServer1 SOFT (DOWN) Random 2nd host check 1m6s later
2025-02-13 12:52:43 - 2 d HOST NOTIFICATION SomeServer1 NOTIFY (DOWN)
2025-02-13 12:52:43 - 2 d HOST ALERT SomeServer1 HARD (DOWN) 3rd Host check 28m55s after last, or 30m1s after first host check
2025-02-13 12:52:45 - 2 d HOST NOTIFICATION RESULT SomeServer1 EXIT_CODE (SUCCESS)
2025-02-13 12:52:45 - 2 d HOST NOTIFICATION SomeServer1 NOTIFY (DOWN)
2025-02-13 12:52:45 - 2 d SERVICE ALERT SomeServer1 Check_MK SOFT (OK) Check_MK 2nd check 30m4s after last, all OK
2025-02-14 06:30:08 - 31 h SERVICE ALERT SomeServer2 Check_MK SOFT (CRITICAL) Check_Mk Down
2025-02-14 06:30:09 - 31 h HOST ALERT SomeServer2 SOFT (DOWN) Host check off back of Check_Mk status change
2025-02-14 06:46:18 - 31 h HOST ALERT SomeServer2 SOFT (DOWN) Random 2nd host check 16m9s later
2025-02-14 07:00:06 - 30 h SERVICE ALERT SomeServer2 Check_MK SOFT (OK) Check_Mk 2nd check 29m2s after last, all OK
2025-02-14 07:00:07 - 30 h HOST NOTIFICATION SomeServer2 NOTIFY (DOWN)
2025-02-14 07:00:07 - 30 h HOST ALERT SomeServer2 HARD (DOWN) 3rd host check 13m49s after last, or 29m58s after first host check
2025-02-15 12:03:05 - 106 m SERVICE ALERT SomeServer3 Check_MK SOFT (CRITICAL) Check_Mk Down
2025-02-15 12:03:06 - 106 m HOST ALERT SomeServer3 SOFT (DOWN) Host check off back of Check_Mk status change
2025-02-15 12:28:39 - 81 m HOST ALERT SomeServer3 SOFT (DOWN) Random host check 25m33s later
2025-02-15 12:33:39 - 76 m SERVICE ALERT SomeServer3 Check_MK SOFT (OK) Check_Mk 2nd check 30m34s after last, all OK
2025-02-15 12:33:40 - 76 m HOST NOTIFICATION SomeServer3 NOTIFY (DOWN)
2025-02-15 12:33:40 - 76 m HOST ALERT SomeServer3 HARD (DOWN) 3rd host check 5m1s after last, or 30m35s after 1st
2025-02-13 08:45:10 - 2.2 d SERVICE ALERT SomeServer4 Check_MK SOFT (CRITICAL) Check_Mk Down
2025-02-13 08:45:11 - 2.2 d HOST ALERT SomeServer4 SOFT (DOWN) Host check off back of Check_Mk status change
2025-02-13 08:45:33 - 2.2 d HOST ALERT SomeServer4 SOFT (DOWN) Random 2nd host check 22s later
2025-02-13 08:52:42 - 2.2 d HOST NOTIFICATION SomeServer4 NOTIFY (DOWN)
2025-02-13 08:52:42 - 2.2 d HOST ALERT SomeServer4 HARD (DOWN) 3rd host check 7m9s after last, or 7m31s after first
2025-02-13 08:52:44 - 2.2 d HOST NOTIFICATION RESULT SomeServer4 EXIT_CODE (SUCCESS)
2025-02-13 08:52:44 - 2.2 d HOST NOTIFICATION SomeServer4 NOTIFY (DOWN)
2025-02-13 09:15:09 - 2.2 d SERVICE ALERT SomeServer4 Check_MK SOFT (OK) Check_Mk 2nd check 29m59s after last, all OK

As you can see above, there is always a host check immediately following the Check_Mk service status change. There is always then a random 2nd check at varying intervals without any Check_Mk checks surrounding it. The 3rd host check almost always runs ~30m after the first host check, but as can be seen in the final example this is not always the case.
So there definitely seems to be some issue with the 2nd host check given that firstly it doesn’t have any Check_Mk service checks surrounding it and secondly it’s occuring a varying, wide ranging intervals. Also the 3rd host check does seem to have an issue too, albeit less frequent, given that in the final example here it’s occuring less than 10 minutes after both the first 2 checks and more than 5m after/before any Check_Mk checks.

As for the 30m retry interval, whilst I don’t feel it’s relevant to this subject and not sure why you felt the need to comment on it, to explain this situation relates to our SNMP checks of server IPMIs. We have the standard check interval set to every 2 hours. We’re monitoring IPMI primarily for disk failures, additionally for CRC errors, PSU errors ect ect, ultimately nothing we need to be notified of immediately. If a server has an issue that causes it to go offline or for performance to be seriously degraded we’ll be notified within 15 minutes from our other monitoring so this monitoring is very low priority and as such I have set a 2hr standard interval with a 30m retry interval to conserve both on system resources and bandwidth. Make sense to you now?

If you really want to work with such large intervals then it is better to use the “Delay host/service notifications”. As i said, the automatic host check in case of a service problem cannot be controlled by any setting. This is classic Nagios behavior.

2 Likes