Checkmk HA-cluster

Hey,
so just for explanation of our infrastructure:
We got multiple sites with virutalization hosts, vms, servers etc.
The checkmk server is running in site 1 and is monitoring via VPN and external ip all of the servers of the other sites.
Working fine, except when that site goes offline. Then we no longer have monitoring for all sites.
Everything is administered and configured from site 1.

I would like to implement a cluster of checkmk servers, but the distributed monitoring feature isn’t quite useful because if site 1 is down there is no possibility to check from site 2 whats the state of site 1.

Is there any way to set it up like a HA cluster for checkmk servers for multiple independent sites?

Thanks in advance

The appliance has HA functionality builtin.

But without the appliance, only with checkmk enterprise on a linux vm it is not possible?

1 Like

possible yes, but you’ll have to built everything yourself

Most people hence just use the vmware HA if needed.

However: building it yourself - you can, at least as an inspiration - use the old tutorial by Simon Meggle @simonm , although I assume he has his reasons for taking it down, pacemaker + drbd probably have changed quite a bit since he first published it 10 years ago :smiley:
so use with caution Nagios/OMD-Cluster mit Pacemaker/DRBD - Teil1 - Simon Meggle

4 Likes

Cross monitoring of sites does not work? That seems odd. Can’t you monitor the server of site 1 from site 2?

@n3m0 Do yourself a favour and invest the time into something else than clustering OMD/CMK on your own.
After I published this blog articles, I ran this setup successfully for 1.5 years. But It required a lot of understanding to operate this cluster reliable. Managing the cluster resources can be tricky and once you have a DRBD split brain you know what I mean…
Like Gerd recommended, better use existing solutions like vsphere HA. Then your monitoring runs on top of that HA layer.

5 Likes

why do you want to do that? not sure what you want to archive in general?

our “main site” does not monitor any hosts, its just there for configuration. Its generally a bad idea to have monitored hosts on the main site.

as you have VMs, setup one VM in distributed monitoring for all sites, including the 1 site. This is the purpose of distributed monitoring. If your main server is down monitoring still works, and people on each site can use the local site(s)

Checkmk does not have any HA features, the Appliance does not have any HA functionality, it has an active/passive setup that is very basic.

1 Like

That does not give you HA. Your VM will still be down while its behind powered on on the other node. This does not also not cover any issues with the VM.

If you mean vShere FT that is a terrible idea on something CPU and disk intensive as Checkmk.

I did not mean FT :slight_smile: but actually VMWare HA. Sure it has its limitations, but I suppose - depending on your view point - the simplicity outweighs the potential issues you might run into with a pacemaker/drbd setup. (We’ve had customers with a drbd setup, that routinely would somehow create hundreds of FS errors on the /omd partition that the Red hat support couldn’t solve… so they went back to vmware HA where the vmdk sync works more stable.)

2 Likes

In a VMware scenario it’s easier as you don’t sync any data, you rely in shared storage. We actually even tried that with Checkmk (having a SAN available) but the main issue with Checkmk (at least in larger installations) are the number of open files that are constantly read/written making it hard to keep disk latency down.

The same problem exists with pacemaker/drbd - or was even worse IIRC.

In a smaller setup this might not be so relevant, but we ended up in physical servers and SSD drives to be able to manage and are running active/passive - knock on wood but this setup have worked for over 5 years. We just rely on rsync and keepalived for distributed monitoring.

1 Like

I could, but only for HA I would need a second checkmk server on site 2 to only monitor the checkmk server 1 on site 1? Or would you suggest to monitor every system from two independent checkmk server on two different sites?
Seems odd to me tbh.

The problem with vSphere HA is that the sites are physically separated and therefore we can’t get HA working properly due to several problems (e.g.: no shared storage,…).
So therefore I have to think of another solution to keep track if there is a problem with the site running the checkmk server.
Thanks!

So if I’m monitoring multiple sites via distributed monitoring I got multiple sites. All fine until now, but in any case I got one site which acts as the central or main checkmk server. All other sites are only remote instances. If the main checkmk server is done for any reason I can’t check the state of it via the remote instances, because the only display the stuff registered on this site.

To summarize, I actually need a way to view the status of all other Checkmk servers and hosts registered on them in any case via one of the X checkmk servers. That is my main need.

So you have to, as we have already suggested buy the virtual appliance, or run your master node in active/passive (but its not supported and you have to setup that yourself)

We run 40 sites or so in distributed monitoring and the number of times our main site goes down is like once ever 3th year. Unless you run your main site one some crappy hardware vor irtualisation layer your main site won’t go down. Most of the issues with a site going down relates to various memory leak, checkmk helper usage etc, your config node with have these problems.

Also bear in mind that the appliance or your own HA won’t cover the majority of the use-cases for a checkmk site going down, for example issues with livestatus, memory leak etc.

Also bear in mind that your slave sites will still send notifications and alerts (unless you proxy them from the main site) via email or any notification integration like slack even if your config node is down

1 Like

I have two more approaches:

  1. Create a read-only site in the other location, that connects to all remote sites via Livestatus, without configuration replication. That way you have visibility until you restored your central site.
  2. Run regular backups of your central site, transfer them to an off-site server, restore them (without starting the site) and do this automated every X hours. That way you have a “lukewarm standby”, that you only need to start, in case the central site fails.