Host down but host notification result shows as OK?

We use the pagerduty integration to send events to pagerduty, one of our hosts went down and is actually still down but the last event notification that was sent to pagerduty was that the host state was OK. Anyone have any idea why?

Here is a screenshot showing the notification history.

Thanks for any help!

The last entry in this table is only the “Host notification result” with OK.
Import is now what time was your last pagerduty entry and is this time corresponding to some log messages.
You can also look at the log file “notify.log” inside “~/var/log/” if there is some more information.
The pagerduty notification produces more than one line what you see inside your screenshot.

I still do not see why the host was marked up, the pagerduty logs and notify.log both concur that at 5:00:24AM the host was marked as down then at 5:31:29AM it was marked as up. The host was down the entire time and showed as down in the checkmk WebUI.

Here is the notify.log for that host and a screenshot from pagerduty.

2020-04-04 05:00:24 Got raw notification (example.hostname) context with 42 variables
2020-04-04 05:00:24 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:00:24 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:00:30 Got spool file 72ec65b0 (example.hostname) for local delivery via pagerduty-agent
2020-04-04 05:00:30 Output: Event processed. Incident Key: (‘event_source=host;host_name=example.hostname’, [0])
2020-04-04 05:02:58 Got raw notification (example.hostname;CPU utilization) context with 71 variables
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:02:58 Got raw notification (example.hostname;Disk IO SUMMARY) context with 71 variables
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:02:58 Got raw notification (example.hostname;Memory used) context with 71 variables
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:02:58 Got raw notification (example.hostname;Overall state) context with 71 variables
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:02:58 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:31:22 Got raw notification (example.hostname) context with 42 variables
2020-04-04 05:31:22 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:31:22 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:31:29 Got spool file 68c1085a (example.hostname) for local delivery via pagerduty-agent
2020-04-04 05:31:29 Output: Event processed. Incident Key: (‘event_source=host;host_name=example.hostname’, [0])
2020-04-04 05:32:03 Got raw notification (example.hostname;CPU utilization) context with 71 variables
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:32:03 Got raw notification (example.hostname;Disk IO SUMMARY) context with 71 variables
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:32:03 Got raw notification (example.hostname;Memory used) context with 71 variables
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:32:03 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:33:05 Got raw notification (example.hostname;Overall state) context with 71 variables
2020-04-04 05:33:05 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:33:05 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:44:00 Got raw notification (example.hostname) context with 42 variables
2020-04-04 05:44:00 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:44:00 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:44:07 Got spool file 8a7ac482 (example.hostname) for local delivery via pagerduty-agent
2020-04-04 05:44:07 Output: Event processed. Incident Key: (‘event_source=host;host_name=example.hostname’, [0])
2020-04-04 05:46:32 Got raw notification (example.hostname;CPU utilization) context with 71 variables
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:46:32 Got raw notification (example.hostname;Disk IO SUMMARY) context with 71 variables
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:46:32 Got raw notification (example.hostname;Memory used) context with 71 variables
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 05:46:32 Got raw notification (example.hostname;Overall state) context with 71 variables
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 05:46:32 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)
2020-04-04 21:41:52 Got raw notification (example.hostname) context with 42 variables
2020-04-04 21:41:52 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (others hosts)
2020-04-04 21:41:52 → does not match: The host’s name ‘example.hostname’ is not on the list of allowed hosts (another host)

Please read @andreas-doehler’s reply again.

The last entry in your table is not the host notification itself but the result of running the notification script. OK says that the notification was sent successfully.
The second entry in that list is the actual host notification of type DOWN.

I posted some additional information in my previous post but here is the exact messages with the host name redacted showing the events received one DOWN and one UP, the host was down and remained down the entire time.

So if the last message is supposed to be a DOWN result as you said then why did it send an UP at 5:31AM?


I am just trying to figure out why it sent an UP result when the host was still down as that to me doesn’t seem correct.

Thanks!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.