CMK version: 2.2.0p23 OS version: Ubuntu 22.04.5 LTS
Error message: I’m using Checkmk 2.2.0p23 (Enterprise Edition) with Distributed Monitoring and several remote sites (same CheckMK Version).
My goal is to get a notification when a remote site becomes unreachable (e.g. site server is shut down, network is interrupted, etc.).
Since there is no native host created for site status, I attempted to follow community guidance that suggested creating a dummy host on the central site to act as the “status host” for the remote site.
What I did:
Created a host on the central site
Set its IP to 127.0.0.1 (as it doesn’t represent a real device) or No IP
Assigned the label: site:<remote-site-id> or cmk/site:<remote-site-id>
Set this host as the status host in the remote site’s connection settings
Performed service discovery on the dummy host
Problem:
The dummy host shows up and has a ping check, but no site status service is created
No other services show up, and no notifications are triggered when the site goes down
All I want is to monitor the availability of remote sites from the central site and receive a notification when one goes offline.
Does anyone have a working example or clear steps for this?
on the central site create a host for the remote site & set to monitor from the central site
give that remote site host the IP that the central server can reach it under
If you’re only interested in reachability (not services) set “CheckMK agent/API integration” to “no agent, no API”
Basically treat it like any other host. Just think about the question: which site should do the ping? That’s the site the “remote site host” must be monitored from.
Personally I monitor all remote sites from the central one with regular hosts, including agent installation. The remote sites do not monitor themselves. I see no value in the latter.
Am i right in assuming that this setup only works, if the central site can reach the remote site directly?
The CheckMK servers at my remote sites only have local IPs (in a different private network from my central site). The livestatus connection between central and remote sites goes through a public firewall IP-Address at the remote site, which then NATs traffic to the local CheckMK Server on the remote site.
Using the public IP address (NAT/firewall IP) results in false positives, because the firewall responds to ping even if the remote Checkmk server is down.
Maybe there is a way to monitor the remote site using livestatus instead of ping? Or am i still missing something here?
If you want to do the normal ping check, then sure, remote must be directly reachable from central. However, there are at least two ways go about it in situations such as yours:
Make agent reachable
Configure the remote network’s firewall to forward the agent port 6557 to the remote site’s host just like the live status port was forwarded
On central configure the remote site’s host with the remote firewall’s public IP & to use the regular CheckMK agent
On central create a rule of type “Host check command”, match it to the remote site’s host, and chose “Agent status” as the method of checking the host’s reachability
Implement VPNs
Run VPN tunnels from each remote sites to the central site (I use Wireguard for this, OpenVPN also works well & might be a tad easier to set up)
Use the VPN tunnel’s remote IP address in the remote site host’s configuration on the central site.
Method 1 changes how CheckMK monitors a host. Instead of trying to ping it directly it relies on the status of the “CheckMK agent” service that’s created when the remote host has an agent installed & registered. This still requires connection from the remote site to the central site (port 8000, the “agent receiver” port, and a couple of others during registration). One advantage is that it is rather easy to implement, especially on the remote network’s firewall’s side as you only need one additional forwarded port.
Method 2 has the advantage of once it’s been set up, you can run any kind of traffic between the central & remote sites without having to battle firewalls between the two.
No matter which way you go, you need some kind of direct connectivity for best results. Sure, you could also whip up some kind of nasty indirect scheme such as:
rent a cheap VM somewhere reachable from both ends
on the remote site implement a timer that uploads a stamp file to the newly rented VM
on the central site implement a local check that tries to download the stamp file & compares its timestamp against the current timestamp, and set the status if the difference becomes too big
This will work, albeit being quite a hassle to set up & maintain (think not just of you, but also your colleagues to whom you’ll have to explain this whacky kind of setup).
PS: the “Host check command” rule doesn’t have a “use status of live status connection” as an available method. It does have a “use TCP connect” method, though. You might want to try connecting to the live status port that’s already forwarded — though I cannot say if that’ll work well or at all; I haven’t tried that with live status yet.
I suspected something like this, but would have struggeled implementing it without your guidance. Especially the part about choosing “Agent Status” as the preferred Host Checking Method was unknown to me.
Definitely worthy of being added to the official documentation for Distributed Monitoring
I will give Method 1 a try and let you know how it goes.