A lot of Flapping hosts every 60 seconds

Hello,

I’m going straight to the point.

Every 60 seconds I got 40, 50 hosts flapping between OK State and Down State, which is causing a lot of false-positives.
Is there any configuration in checkmk that could potentially be the root cause for this?
I don’t want to turn off flapping for those hosts. Also, in checkmk it says flapping but in reality they’re always UP, so its not a switch/firewall/AccessPoint problem.

Thank you in advance

Hello,
did you look on your site statistics/load?

You mean that these hosts are shown as down for some seconds and then again they are up?
Can you please show some lines from the host events of such a host?

yeah, they show as down and if I refresh again, they’re gone and vice versa.
Is there any latency configuration or other type of configuration that would make sense? These hosts are, some of them, from pretty far locations, so the latency is not that good? Could that be the case?
Unfortunately, I can’t show you any information, as it is confidential.

You can remove the hostnames or IPs from the log important is why you get an log entry. Without this information nothing can be said why it is in your case.

Example from my host event log.


If you remove the hostname no confidential information are shown.

Here! I left the numbers so you can see the same hosts flapping

Problem found. Why is the deadline of you smartping set to 5 seconds?
It only sents one ping every 6 seconds.
The default deadline is 15 seconds what means 2,5 normal intervals of 6 seconds.

I know, because I was trying to tweak and see if it works, but I had once in 15 deadline, still doesn’t work. Look this example of 1 host going up and down, up and down, in seconds
I was running a ping while refreshing in checkmk and I got no losses but in checkmk it shows as down

Can you show your smart ping settings. Also your seconds screenshot looks very suspicious.
The correct settings must look like this. For the default settings.


Why does it look suspicious?
The print screen is a host that is always flapping with the default settings, which is 2,5 normal intervals of 6 seconds.
I don’t have any configuration a side of that.


image

You receive exactly at the moment of the timeout a ping packet. That means for myself that your send interval is the same as the timeout interval.

If you think that all the configuration is correct, what happens if you switch for this host to a classic ping as host check?


These are the settings I have for all hosts. They are the default ones.
Changing the SMART Ping to Normal Ping, made a lot of my hosts go stale and the problem with flapping still persisted.

Hi @PedroPereira,

for me this situation looks like a networking problem. Maybe some devices are blocking SMARTPings or delaying them. Maybe your round trip time is also not the best, would fit your

Can you enable the normal ping check and take a look on your round trip times? Maybe in your case it’s a good idea to lower the check interval of these hosts completely. We also have some hosts at locations with bad connection an therefore we reduced the interval to 2 or 3 minutes and also set the round trip alert much higher (sometimes over 1.5 sec. for ICMP to return).

If you still like use the SMARTPing function i would recommend to drastically higher the values @andreas-doehler mentioned.

Ok, I think it worked, I see no movement in the “down hosts” tab.
What I do realize now is that I have a lot of stale hosts when switching to Normal PING

Stale means, they don’t answer within 1.5 times of your normal check interval (default 1 minute). Maybe you should investigate your hosts further with this in mind. What is the average round trip time for these hosts?

Some are below 1 ms. I think the max value I encounter by far was 20 ms

Under 1ms for hosts far away from the location? Doesn’t sound realistic. But if so, your problem seems more likely to be related to SMARTPing instead of ICMP traffic.

No No…
what happen was that, the flapping hosts was solved by changing to normal Ping, but after that a lot of hosts turned into stale, even the ones that were near me.

These hosts in the picture are UP and not stale, as far as i can tell.

The spider web icon means the host are stale, which means they’re not receiving data. At least that’s what it says on checkmk