HTTP code 503 immediately notified without check attempts

CMK version:
2.2.0p19
OS version:
Appliance
Error message:
We monitor several website and when a website was some kind of issue we don’t want to get notified immediately. CheckMK needs to check 3 times. When HTTP Code is still not 200, we want to get notified.
When I test with 504, it will check 3 times before allert.
When I test with 503, it will send a notification but I don’t want that.

Setup an delay won’t work either.

Please help me with this behaviour.

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Hi Dave,

you can set the “Maximum number of check attempts for service” to 3 for these services.

I know but that aint working.
still get notified asap :frowning:

You can check if the service has the correct configuration when you click the service and search in the table for “Current check attempt”.
I assume that the rule does not match the service correctly.

It worked for a short while.
When I test again after a few hours, the issue I had, still exists;

My settings:

  • Maximum number of check attempts for service: 3, Based on Services HTTPS .*
  • Notify when critical
  • Check: Just a HTTPS URI, nothing fancy
  • Service normal /retry check interval: 60/60s

When I trigger an 503 error on the website that I monitor, I get directly after the first check a notification.
When I trigger a 504 error on that same website, I get a notification after 3 check attempts.
Why is 503 that special in CheckMK ?

I updated the HTTPS Services so it will try to match “Strings to expect in server response:
200”
I changed the trigger back to a 503 error, and guess what, it didn’t send me directly an notification but only after 3 checks!.
I don’t understand why.

When I test again a few minutes later, again when the first error hits, I get notified !.. grrr

Why is this so unstable in reaction. anyone got a clue ?

Here you see its working as i expect:

But when i try it a few minutes later:

Is their some kind of mechanisme that it notified if error occured with a certain time period from the first error till the second ?

1 Like

I think I know why its not working as i expect but I cannot find proof of that.
Somewhere in CheckMK their is an rule that might check how long the last error was triggered. If that is below certain period, notification is send faster.
I thought about Flapping-settings but those I have disabled on this host and still its notifiying when hitting first error

For this problem i would look inside the core log file at the specific time.
I would hope that there are a little bit more entries can be found.

I enabled all kind of loggings and when I trigger an error do an tail on my /var/log/*.log | grep student-20, I got the following a couple of times.

2024-01-29 11:44:30 [7] [notification helper 4084] service "student-20.xxxx;HTTPS Homepage": postponing, notifications are disabled, but periodic notifications are enabled

I checked my configuration and this is correct because when a service has an issue and is not acknowledged we want a recurring notification so that will be send every 5 minutes (untill acknowledged)
I disabled that setting just for testing purposes and triggered again an issue

When first CRIT state was found, no notification was send. Also not after 3 check attemps and then I saw

2024-01-29 11:51:37 [7] [notification helper 17114] service "student-20.xxxxx;HTTPS Homepage": postponing, delayed notification.

That is also expected because I had an rule for this to delay it for 5 minutes.

When I disable that “Delay Notification” rule and triggered an CRIT, I received an telegram. but only 1 and not every x minutes which we do want so I enabled “Periodic notifications during service problems” and triggered an CRIT.
Suddenly after 1 check attempts i get notified. I the logs i see:

2024-01-29 14:14:59 [7] [alert helper 15834] not sending alert of type CHECKRESULT about service "student-20.XXXXX;HTTPS Homepage": there are no alert handlers defined
2024-01-29 14:14:59 [7] [alert helper 15834] not sending alert of type STATECHANGE about service "student-20.XXXXX;HTTPS Homepage": there are no alert handlers defined
2024-01-29 14:14:59 [7] [core 15788] released SerialToken{Request[student-20.XXXXX],3122} => SerialTokenFactory{3122:10}
2024-01-29 14:14:59 [7] [core 15788] scheduling service "student-20.XXXXX;HTTPS Homepage" at 2024-01-29 14:15:59 with commandline [/omd/sites/poort80hs/lib/nagios/plugins/check_http --ssl -t 60 --onredirect=follow -e 200,302,301 --sni -I 'student-20.XXXXX' -H 'student-20.XXXXX']
2024-01-29 14:14:59 [7] [core 15788] [generic pool scheduler] scheduling service "student-20.XXXXX;HTTPS Homepage" at 2024-01-29 14:15:59
2024-01-29 14:14:59 [7] [alert helper 15834] not sending alert of type CHECKRESULT about host "student-20.XXXXX": there are no alert handlers defined
2024-01-29 14:14:59 [7] [alert helper 15834] not sending alert of type CHECKRESULT about host "student-20.XXXXX": there are no alert handlers defined
2024-01-29 14:15:00 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": sending PROBLEM notification to its contacts
2024-01-29 14:15:00 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": spooling notification to rule based notifications
HOSTNAME=student-20.XXXXX
HOSTALIAS=student-20.XXXXX
2024-01-29 14:15:00 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": next notification in 5 minutes
HOSTNAME=student-20.XXXXX
HOSTALIAS=student-20.XXXXX
2024-01-29 14:15:01 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:02 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:03 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:04 [7] [core 15788] [livestatus external] spooled command 'LOG;SERVICE NOTIFICATION:xxxxxxxxxx
.......
2024-01-29 14:15:06 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:07 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:08 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:09 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:10 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:11 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:12 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:13 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:14 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:15 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:16 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:17 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:18 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:19 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:20 [7] [alert helper 15834] not sending alert of type CHECKRESULT about host "student-20.XXXXX": there are no alert handlers defined
2024-01-29 14:15:20 [7] [alert helper 15834] not sending alert of type CHECKRESULT about host "student-20.XXXXX": there are no alert handlers defined
2024-01-29 14:15:20 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification
2024-01-29 14:15:21 [7] [notification helper 16890] service "student-20.XXXXX;HTTPS Homepage": postponing, periodic notification

Its unexpected in my opinion that I receive an message. because the check attempts were not all done yet. Only 1 of the 3
I made sure the site was OK again, waited a few minutes and again I triggerd an error

When CheckMK noticed that CRIT, I immediately get notified and it should not.
I checked the notification analysis and I saw that my first at 14:15 was SERVICENOTIFICATIONNUMBER 1.
The one i triggered at last (more then 5 minutens later) was SERVICENOTIFICATIONNUMBER 2.
Why that is, I have no clue because the services was already OK for more then 5 minutes

What more can I do to accomplish this:

  • When services goes down, checks 3 times.
  • When still down, get notified.
  • Wait x minutes and if still not acknowledged, notify again.

Sorry i think i was not clear about what log file. This what you posted is the cmc.log. But please also look inside the “~/var/check_mk/core/history” here you see every check attempt and what’s happens.

This is example with 5 check attempts from the core log.

[1706557435] SERVICE ALERT: CO2;CO2 board co2;WARNING;SOFT;1;CO2 level is slightly too high at 1001ppm (threshold from plugin)
[1706557435] SERVICE ALERT: CO2;CO2 level (ppm);WARNING;SOFT;1;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1001.00 (warn/crit at 1000.00/1200.00)(!)
[1706557496] SERVICE ALERT: CO2;CO2 board co2;WARNING;SOFT;2;CO2 level is slightly too high at 1008ppm (threshold from plugin)
[1706557496] SERVICE ALERT: CO2;CO2 level (ppm);WARNING;SOFT;2;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1008.00 (warn/crit at 1000.00/1200.00)(!)
[1706557556] SERVICE ALERT: CO2;CO2 board co2;WARNING;SOFT;3;CO2 level is slightly too high at 1013ppm (threshold from plugin)
[1706557556] SERVICE ALERT: CO2;CO2 level (ppm);WARNING;SOFT;3;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1013.00 (warn/crit at 1000.00/1200.00)(!)
[1706557616] SERVICE ALERT: CO2;CO2 board co2;WARNING;SOFT;4;CO2 level is slightly too high at 1019ppm (threshold from plugin)
[1706557616] SERVICE ALERT: CO2;CO2 level (ppm);WARNING;SOFT;4;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1019.00 (warn/crit at 1000.00/1200.00)(!)
[1706557676] SERVICE ALERT: CO2;CO2 board co2;WARNING;HARD;5;CO2 level is slightly too high at 1029ppm (threshold from plugin)
[1706557676] SERVICE ALERT: CO2;CO2 level (ppm);WARNING;HARD;5;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1029.00 (warn/crit at 1000.00/1200.00)(!)
[1706557677] SERVICE NOTIFICATION: check-mk-notify;CO2;CO2 board co2;WARNING;check-mk-notify;CO2 level is slightly too high at 1029ppm (threshold from plugin);;
[1706557677] SERVICE NOTIFICATION: check-mk-notify;CO2;CO2 level (ppm);WARNING;check-mk-notify;CO2/ventilation control with Watterott CO2-Ampel, thresholds taken from sensor board., Co 2 ppm: 1029.00 (
warn/crit at 1000.00/1200.00)(!);;

Andreas,
this morning i started again an error and here is my history;

[1706607839] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706607900] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;2;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706607960] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;HARD;3;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706607960] SERVICE NOTIFICATION: check-mk-notify;student-20.XXXXX;HTTPS Homepage;CRITICAL;check-mk-notify;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable;;
[1706607965] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706607965] SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706608125] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION RESULT: dave.greebe;student-20.XXXXX;HTTPS Homepage;OK;check_mk_telegram-notify.sh;{"ok":true,....... }

After 10 minutes I get re-notified which is good. that is something i have set up. (Periodic notification…)

I put the site up again and within a few minute I triggered a CRIT again.

Right after the first check; i get this error:

[1706608747] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706608748] SERVICE NOTIFICATION: check-mk-notify;student-20.XXXXX;HTTPS Homepage;CRITICAL;check-mk-notify;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable;;
[1706608752] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706608752] SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706608752] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION RESULT: dave.greebe;student-20.XXXXX;HTTPS Homepage;OK;check_mk_telegram-notify.sh;{"ok":true,"result....}

Okay, this might because of the 10 minutes Periodic Notification.

So I waited 13 minutes untill I started it again. I noticed that on the service details page I saw: Service Notification Number 3. I looked for that and in this post someone else also mentioned this: [Check_mk (english)] Should service notification number be reset to 0 when changing state?

When CheckMK saw that my page was in CRIT state:

[1706609893] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706609894] SERVICE NOTIFICATION: check-mk-notify;student-20.XXXXX;HTTPS Homepage;CRITICAL;check-mk-notify;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable;;
[1706609899] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706609899] SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706609899] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION RESULT: dave.greebe;student-20.XXXXX;HTTPS Homepage;OK;check_mk_telegram-notify.sh;{"ok":true....}

Immediately an notification and not 3 check attempts.

Again, when my service was OK (site up) I disabled the rule regarding Periodic Notification…,
Checked the number of “Service notification number” and that was 1.

waited a few minutes and triggered again an CRIT.
Now 3 checks attempts were made and I get notified after 3 checks;

[1706610619] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706610679] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;SOFT;2;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706610739] SERVICE ALERT: student-20.XXXXX;HTTPS Homepage;CRITICAL;HARD;3;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706610740] SERVICE NOTIFICATION: check-mk-notify;student-20.XXXXX;HTTPS Homepage;CRITICAL;check-mk-notify;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable;;
[1706610745] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706610745] SERVICE NOTIFICATION: dave.greebe;student-20.XXXXX;HTTPS Homepage;CRITICAL;check_mk_telegram-notify.sh;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable
[1706610745] EXTERNAL COMMAND: LOG;SERVICE NOTIFICATION RESULT: dave.greebe;student-20.XXXXX;HTTPS Homepage;OK;check_mk_telegram-notify.sh;{"ok":tru....}

When I check again the “Service notification number” it was again on 1 so I wonder if this is a misconfiguration or a bug.

After a few days letting this go I tested it again;

Periodic Notifications: 5 minutes
Check Attempts: 3
Check interval: 60s/60s

Trigger a crit on website:
9:00: OK
9:01: CRIT
9:02: CRIT
9:03: CRIT
9:03: Notification
9:08: Notification
9:13: Notification
9:18: Notification
9:20: OK → but site back

When I check the page, I see that my Service Notification Number is 4.
That should not be correct because it should be reset after service is OK.

I check this number a couple of times but its now 30 minutes past and still the Service Notification Number = 4

I reported this as a bug to feedback@checkmk.com

Solution found, based on the info in this post;

I had my rule “Notified events for services” set to “Service goes into critical state”
When my service goes to critical and after a few checks to OK, the CMC is not notified about this so the SERVICENOTIFICATIONNUMBER was still at x and was never reset.

I updated the rule “Notified events for services” and added “Service goes into OK state”
I triggered an CRIT and after a few times service was OK and now the SERVICENOTIFICATIONNUMBER was reset to 1.
Triggered an error again and now it started 3 check attempts and then notification.

Finally solved my issue.

1 Like