FALSE ALARMS: does anyone NOT set "Maximum number of check attempts" to > 1?

aa777888 · May 18, 2022, 11:12am

I’m new to CheckMk. Over the last few weeks I’ve migrated my NMS from PRTG to CheckMk. I’m monitoring a relatively sophisticated SOHO environment with a wide variety of networking equipment, NAS, IoT and hosts. 25 hosts and 408 services total.

And the number of false alarms is absolutely crazy!

Most are a single bounce. Down at check X. Up at check X+1. Mostly simple stuff: ping down, ping up. Host check down, host check up. HTTP service down, HTTP service up. It’s maddening!

Now before you tell me to clean up my act, I have certainly already done this. I fixed some Linux and Windows process problems on a few hosts that had logwatch complaining. I found a Wi-Fi configuration problem that was affecting some Pi-based IoT. And I have learned about a shaky SNMP implementation on QNAP NAS. And I bulk emails so that these bounces only appear as a single email.

But these service and host check bounces remain.

Surely the rest of you must be suffering through this? How do you handle false alarms? It would seem that the only solution is to use the “Maximum number of check attempts” service and host rules to filter these events out.

andreas-doehler · May 18, 2022, 11:53am

Short answer no. If i have systems with such behavior then most times there is a deeper problem in the monitoring system or the infrastructure.
I saw such problems if the CMK environment was running inside docker containers as these containers have big problems with dynamic and very quick load changes.
Sometimes it is a network issue, that should not be the problem in your small environment.

The “Maximum number of check attempts” is used in my systems to reduce unwanted mails after a site restart or anything like a unstable WAN connection to a minimum.
You need to define generally after what time a system or service should be seen as faulty/down and this time divided by the check interval gives you the right check attempts.
Here i have a range from 1 to 20 on some systems.

aa777888 · May 18, 2022, 3:17pm

CheckMk is not running in a container. It’s an Ubuntu Focal Hyper-V image on a very fast Windows machine.

The problems are mostly among printers, IoT devices, and certain hosts that are actually in Docker containers running on a NAS. Most of the network equipment is rock solid, but the WAPs occasionally false alarm.

These HAVE to be false alarms. I say this because I’m still running PRTG in parallel on these problem children to act as a benchmark. PRTG is also running on a Hyper-V machine so same conditions as for CheckMk. PRTG never reports a service or host down while CheckMk does. And, if I go to check on something reported down, it is happily up and operating.

That said, I have no idea how the service monitoring code works in PRTG. But it works a lot better out-of-the-box for false alarms than CheckMk does.

I’m committed to CheckMk. But I’d really like to know why so many tests erroneously fail. At this point I’d say it’s primarily ping (both host checks and explicity checks), and HTTP related checks. Occasionally there is an SNMP failure, too, but those are more rare.

system · May 18, 2023, 3:18pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.