Seems if host change to Down before the Ping service change to Critical, the notification email doesn't work

CMK version:
OMD - Open Monitoring Distribution Version 2.1.0p15.cre

OS version:
Linux version 3.10.0-1160.76.1.el7.x86_64 (mockbuild@x86-vm-41.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jul 26 14:15:37 UTC 2022

Hello everyone,
I am using the raw edition of checkmk to monitor my hosts, and I’ve been struggling with some issue with the email notifition . I am simply using the default ping service (when add a new host, it has the default ping service enabled for this host). I’ve no idea what is the mechanism of the host down and ping service changes to critical state, I’ve done some tests by enable/disable my test pc LAN, I notice that if the ping service changes to critical state before the host change to down, the email can be sent out, on the opposite, if the host state change to down before the ping service change to critical, the email notification won’t be sent out. It seems to be a race between the host state and ping service state. I am using all the default setting and didn’t change any thing, so interval to check host is 1 min. As below screenshot shows, if the service state change to Critical then email is my email “ming.chen”, but if there is no service state change to critical but only host state change to down, then no email is being sent toe “ming.chen”.I am very confused about how the CheckMk works. can someone please give some help ?


The host state is also a normal ping in the background.
For the host state notification i would recommend to not use this “PING” service but the host state itself. In your log you see that it also sent’s mail for the host state.

Here i would recommend to make adjustments for the setting “Maximum number of check attempts for service” and “Maximum number of check attempts for hosts”.
The default value of 1 is not very good for a production environment.
The value for hosts is normally smaller than the value for services to prevent the race condition in the host down situation.

Here i would inspect the notification rules or the contact assignment rules.
If your user is only contact for the service and not for the host then the observed behavior is perfectly fine as the default notification rule only sent mail if you are contact for the object.

thanks for the reply, actually currently I only set the notification for the service event type “ok ->crit” and “crit ->ok” and haven’t set the host event type, as I hope to only receive the service event type change email rather that host state change email, is it the correct way to set like this ?

For the “Maximum number of check attempts for hosts” and “Maximum number of check attempts for service”, I’ve tried out these for many times, I’ve tried different combination of values , say if value of hosts is set to smaller/bigger/equal to the value of service, then the emails won’t send out, the email would work only when I do not set any values to these 2 items…

No - if a host state change is detected from the system then no further service state change get notified.
Host state change has precedence over service state change.
Only for Down state :wink:

Host value should be smaller than service value under any circumstances i would say. This is only true for RAW edition - enterprise and smart ping is a different thing.

thanks, so what should I set to make the host state change to down happens only when the ping service has changed to critical , so that the service notification email can be sent out before the host changes to down state .

Don’t try to do this - let the host down mail also be sent to you and that’s it.

so I should setup the notification like below for both host and service event type change, and I should be able to receive both host’s up/down changes and the service “ok<->critical” changes, correct ?

These changes you will only receive if the host is not down.
That is the reason why the check attempts for host should be lower than the service check attempts. Then there will be no mix between host down and service critical.

I followed your suggestions to set the host value is 2 and service value is 3, and tried disable/enable the test pc LAN, but I notice that the mail is sometimes sent to checkmk.slsh@xxxxx (which is the fallback email address I setup in the global setting and it’s a invalid email email as I don’t want to receive any weird fallback email), and sometime it is sent to ming.chen which is the correct email address i am expecting to receive the email, can you please advise why is that ? Thank you !


If you don’t want to receive any fallback mail i would not set a fallback address at all.

To the not delivered notification - i would take a look at the “analysis” button for this notification.
image
This can give some insights why it is not delivered to your user.

thanks for your reply Andreas. I’ve done some tests today and below is test result,
1.tried change the maxi host value to 2 and maxi service value to 3, disabled the LAN, no matter whether the ping service change to crit before or after the host change the down, no email can be sent.
2. Do not set the maxi host/service(leave it as the default ), disabled the LAN, if the ping service change to crit before the host changes to down, email can be sent. If it’s after, then no email is sent.

In conclusion, the email can be sent our only when the maxi host/service is not set, and the service changes to crit before the host changes to down. This is what I 've observed.

ping service changes to Crit before Host changes to down, email can be sent


BTW, do we need/must to add rule for the ping and host check parmeters, or we can simply just the rely on the default Ping service come along during host creation.

Hi Andreas, if I set max host value is less than the max service value, I think this would make the host goes down before the service go to critical, right ? I tried that and I notice only the Host up/down email can be sent, but the service ok/critical is not sent . Any way to make both emails sent ?

Yes

correct - that is what should be achieved

Not really - if a host is down then all the services are normally in a virtual “unknown” state.

Thanks for you reply Andreas. Beside raw edition, we are also using the enterprise edition which we purchased … As you mentioned before, the enterprise edition use smartping which is different thing from raw edition, currently my setup is : i use the ping service under SetupServices>HTTP, TCP, Email, …>Check hosts with PING (ICMP Echo Request), and set the max service value to 5, Do Not set the max host value, and set the smartping value to 400sec ( as I think 400 is slightly > (5+1)*60, so the host will go down slightly after the ping go to critical ), and such setting seems works fine, I can receive the service ok/critical email fine ( i did not configure to receive host up/down email). Is such setting correct , any problem ? Thanks.




if only host up/down email can be sent, how can we receive service ok/critical email, if the host is setup with some services which we need to monitor and expect to receive emails notification .

If the host is up you will receive the service notifications.

thanks Andreas. 1 more question, for the notification configuration part, may i know how does it work, it would only send out email for the first rule matches, or it would send out email for all
the rules that match?

Hi Andreas, can you please advise what the issue could be if i set both the max host and max service to the same value, say 3, and I setup both host and service email nofitification, will there any problem ,any email won’t sent out ? I tested it and it seems work fine, i can receive the host down/up email, but not receiving ping service ok/crit email which is same as if i set max host value less than service value). thanks

I am worry about If I only receive the host up/down email but not the service ok/critial email, in the situation that if the ping packet lost is over critical state ,say 90%, or RTA is over 500ms, but the host is not dead, i will miss such email notification.

The second one - all rules are evaluated and at the end all the resulting notifications are sent. I have systems with over 40 rules here.

There don’t need to be an issue. The same value means nothing without information about the check interval. The 3 only means you need 3 check attempts to reach an hard state. That’s all. After reaching the hard state a notification event is generated and the notification rules are processed.

It depends - what is your host check command and what is your check interval.

Hi Andreas, below is the host check command and the check interval. I believe they are the default ones , any issue ?