Every 60 seconds I got 40, 50 hosts flapping between OK State and Down State, which is causing a lot of false-positives.
Is there any configuration in checkmk that could potentially be the root cause for this?
I don’t want to turn off flapping for those hosts. Also, in checkmk it says flapping but in reality they’re always UP, so its not a switch/firewall/AccessPoint problem.
You mean that these hosts are shown as down for some seconds and then again they are up?
Can you please show some lines from the host events of such a host?
yeah, they show as down and if I refresh again, they’re gone and vice versa.
Is there any latency configuration or other type of configuration that would make sense? These hosts are, some of them, from pretty far locations, so the latency is not that good? Could that be the case?
Unfortunately, I can’t show you any information, as it is confidential.
You can remove the hostnames or IPs from the log important is why you get an log entry. Without this information nothing can be said why it is in your case.
Problem found. Why is the deadline of you smartping set to 5 seconds?
It only sents one ping every 6 seconds.
The default deadline is 15 seconds what means 2,5 normal intervals of 6 seconds.
I know, because I was trying to tweak and see if it works, but I had once in 15 deadline, still doesn’t work. Look this example of 1 host going up and down, up and down, in seconds
I was running a ping while refreshing in checkmk and I got no losses but in checkmk it shows as down
Can you show your smart ping settings. Also your seconds screenshot looks very suspicious.
The correct settings must look like this. For the default settings.
Why does it look suspicious?
The print screen is a host that is always flapping with the default settings, which is 2,5 normal intervals of 6 seconds.
I don’t have any configuration a side of that.
These are the settings I have for all hosts. They are the default ones.
Changing the SMART Ping to Normal Ping, made a lot of my hosts go stale and the problem with flapping still persisted.
for me this situation looks like a networking problem. Maybe some devices are blocking SMARTPings or delaying them. Maybe your round trip time is also not the best, would fit your
Can you enable the normal ping check and take a look on your round trip times? Maybe in your case it’s a good idea to lower the check interval of these hosts completely. We also have some hosts at locations with bad connection an therefore we reduced the interval to 2 or 3 minutes and also set the round trip alert much higher (sometimes over 1.5 sec. for ICMP to return).
If you still like use the SMARTPing function i would recommend to drastically higher the values @andreas-doehler mentioned.
Ok, I think it worked, I see no movement in the “down hosts” tab.
What I do realize now is that I have a lot of stale hosts when switching to Normal PING
Stale means, they don’t answer within 1.5 times of your normal check interval (default 1 minute). Maybe you should investigate your hosts further with this in mind. What is the average round trip time for these hosts?
Under 1ms for hosts far away from the location? Doesn’t sound realistic. But if so, your problem seems more likely to be related to SMARTPing instead of ICMP traffic.
No No…
what happen was that, the flapping hosts was solved by changing to normal Ping, but after that a lot of hosts turned into stale, even the ones that were near me.