Availability / Clustering / Load Balancing

tjbrumfield · March 18, 2024, 8:01pm

We have one CheckMK instance right now in a Kubernetes cluster and it is frequently crashing. We are looking at rebuilding it on a VM as opposed to in a container to see if it helps with availability.

With other applications, we’d start up multiple servers behind a load balancer for availability. Is it possible to do this with CheckMK?

The documentation talks about Distributed Monitoring with LiveStatus, but that seems to be just manually spitting up into multiple sites on multiple servers. But if a site/server were to go down, we would lose monitoring for it. Is it possible to cluster or load balance the entire CheckMK site across multiple servers to increase availability?

elias.voelker · March 18, 2024, 9:31pm

Hi TJ,

that is possible. Checkmk itself is agnostic whether it is run in a cluster or not. The clustering is happening at the OS level.

Here is some documentation how to do this using the Checkmk appliance:

But you can achieve the same thing with the tools your hypervisor or cloud provider of choice gives you.

HTH
Elias

Anders · March 18, 2024, 10:01pm

I’d recommend reading previous posts here as the topic have been discussed multiple times and you have all your answers there

tjbrumfield · March 19, 2024, 8:34pm

I read through that documentation before posting.

That documentation is for physical appliances and it says it is not recommended to cluster two virtual machines.

Is there an actual recommended strategy for high availability without physical appliances?

andreas-doehler · March 19, 2024, 8:36pm

Easy solution → have a high availability virtualization platform, that’s all.
Or do you only need the high availability for a planned maintenance failover?

elias.voelker · March 19, 2024, 9:11pm

Yes. Use the tools your hypervisor or cloud provider of choice gives you. The possibilities are both endless and vendor-specific, so we couldn’t document all of them.

tjbrumfield · March 20, 2024, 3:14pm

The hypervisor provides some level of availability that if the underlying hardware has a problem, the VM can shift with vMotion or a similar feature.

A single VM can have software problems within the VM and then lose availability. Clustering could help with that, but the documentation says that clustering is not recommended for virtual machines with CheckMK.