CMK version: Checkmk Raw Edition 2.3.0p26
OS version: Ubuntu 24.04
When using Ping as the Host Check Command, the configured Retry Check Interval appears to be adhered to, but when using “Status of Service Check_MK” or “Status of CheckMk Agent” the Retry Interval is sporadically ignored.
Experienced this with our live system and also with a test system I setup just to investigate this.
Eg, configure a Retry Interval of 30m with a Max Retry Count of 3, under Ping the host checks are carried out correctly every 30m. When using Check_Mk status the first two host checks occur typically within a minute of each other, the last sometimes waits the 30m sometimes does not.
Some logs from my testing
|Time|Event|Host|Service|State info|
** Using Ping as host check command, 5m retry interval **
|2025-02-12 15:29:46 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:34:49 - 2 d|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:39:52 - 2 d|HOST ALERT|TestVM||HARD (DOWN)|
|2025-02-12 15:52:15 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 15:57:18 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:02:21 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|
** Using Status of Service: Check_MK as host check command, 5m retry interval
|2025-02-12 16:19:48 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:20:04 - 47 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 16:25:05 - 47 h|HOST ALERT|TestVM||HARD (DOWN)|
|2025-02-12 17:11:13 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:11:51 - 46 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-12 17:16:52 - 46 h|HOST ALERT|TestVM||HARD (DOWN)|
** Using Status of CheckMk Agent as Host Check Command, 5m retry interval
|2025-02-13 12:41:10 - 27 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:41:51 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 12:46:52 - 26 h|HOST ALERT|TestVM||HARD (DOWN)|
** Using Status of CheckMk Agent as Host Check Command, 30m retry interval
|2025-02-13 12:56:01 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:26:02 - 26 h|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-13 13:56:03 - 25 h|HOST ALERT|TestVM||HARD (DOWN)|
|** Using Status of Service: Check_MK as Host Check Command, 30m retry interval|||||
|2025-02-14 15:06:16 - 35 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:06:51 - 34 m|HOST ALERT|TestVM||SOFT (DOWN)|
|2025-02-14 15:29:25 - 11 m|HOST ALERT|TestVM||HARD (DOWN)|
Admittedly, on my live system the 3rd check is way more sporadic than in my testing, only once in my testing was the 3rd check off time. There’s also an issue with Delay Notification but I see that’s already been reported a few times.
Where this becomes a problem is we’re receiving a lot of host down alerts because the host checks are completing their 3 attempts within the 30 minutes retry interval of the Check_Mk service, so essentially one missed Check_Mk check (when the second one succeeds) is giving us false host downs.