CheckMK loses its mind when my firewall reboots

I’m using CheckMK RAW edition, and it’s going great. The only issue I have is that whenever I reboot the firewall (routine maintenance) basically every single service I’m polling on the local LAN goes stale, which is what I mean by “losing its mind”.

It makes sense that services reachable via the firewall go stale, but I have no idea why all these services on the local LAN go stale too (CheckMK server is on the LAN). Can someone please explain what’s going on, and if there’s a fix for this? And in case you’re wondering - am not using the firewall as a DNS server - am using the local Windows Domain Controller and DNS server for that, and all the things I poll resolve locally.

What I do at the moment, is reboot the VM running CheckMK (in Docker), every time I have an Internet outage or planned reboot of the Internet router. If I don’t, then I have to go to hundreds of individual services and right click and choose “Reschedule check” or else they never recover. It’s just easier to reboot the whole VM - which is kind of annoying.

Thanks.

We had basically the same thing happen when we shutdown a windows server. We are running the enterprise version 2.4.0.P14. When we took the server down Check, like you said lost its mind. No clue why, we lost every thing and had to restore the device from a snapshot about 12hrs before so we lost data. We took the server down a second time to decom it, and it worked so we are thinking there some some type of check happening on the server we took down and Checkmk did not know how to recover.

Yes it definitely seems that CheckMK does not fail gracefully. 90% of polled hosts are at the local head office (where CheckMK server is), with 10% being at a remote site reachable via VPN (through firewall). When the firewall goes down, the dozen or so hosts at the remote site time out as you’d expect, but then EVERYTHING goes stale (roughly 100 at the local head office).

So it seems that if enough hosts time-out at the same time, or certain types of hosts, then this triggers an issue that CheckMK just can’t handle and recover from, without taking the whole server down.

I don’t recall this issue in older (v2.2, v2.3) versions of CheckMK. Our VPN is pretty stable as we have redundant links with BGP failover + sub-second BFD to the remote site, and I’m not testing for this fault condition with every update to CheckMK. So really I have no idea when this issue crept in. As a wild guess I’d say in 2.4, but like I said it really is a total guess - it may well have been earlier, and I never noticed.

Hopefully more people can find this post and relate their stories, as devs might not be aware of how widespread it is.

That cannot be correct. You can of course overload your Checkmk server, but your failure mode sounds different. Unless you can see the compute resources being overloaded. Otherwise Checkmk handles lots of down hosts differently than what you describe.
Do the logs reveal anything useful?
Also: Why would you run Checkmk in a container on a VM? Why not install Checkmk natively on the OS?

In the end, this could either be a weird behavior from your firewall during reboot, sending Checkmk on the Fritz, or it might even be Docker-related. But both things are very hard to pinpoint.

Thanks for taking an interest in this issue.

Our backup vendor charges licences per VM backed up. So minimising the # of VMs minimises costs. A Docker-based setup ensures that application dependency issues are a thing of the past, as each Docker application container has its own independent runtime. Multiple apps can use the same TCP/UDP ports with the way you can easily map ports in the Docker compose file.

When I wanted to run CheckMK, I already had a Debian Linux VM setup as a Docker host with ample spare compute/memory/storage capacity, so the Docker version was a natural fit.

One great benefit though is I have NGINX Proxy Manager setup on the same host (jc21/nginx-proxy-manager:latest) and I use that for Let’s Encrypt self-signed certificates with a wildcard cert, which can be used for all the Docker containers on that Linux VM. And Docker makes it easy as it can refer to each container using a name identifier, so the proxy configuration is very easy.

So in short I use Docker for cost savings, consistency, easy administration, no runtime conflicts, and simple validate SSL certificates.

In terms of compute resources, one core is often maxed out, but there are always other cores which are idle. RAM is not an issue - I have given CheckMK a lot of RAM.

What logs in particular should I be looking at? There seems to be quite a lot Checkmk on the command line - Understanding and using commands

For core problems i would first take a look at the cmc or nagios log depending on your used CMK version.

1 Like