We have one CheckMK instance right now in a Kubernetes cluster and it is frequently crashing. We are looking at rebuilding it on a VM as opposed to in a container to see if it helps with availability.
With other applications, we’d start up multiple servers behind a load balancer for availability. Is it possible to do this with CheckMK?
The documentation talks about Distributed Monitoring with LiveStatus, but that seems to be just manually spitting up into multiple sites on multiple servers. But if a site/server were to go down, we would lose monitoring for it. Is it possible to cluster or load balance the entire CheckMK site across multiple servers to increase availability?
Easy solution → have a high availability virtualization platform, that’s all.
Or do you only need the high availability for a planned maintenance failover?
Yes. Use the tools your hypervisor or cloud provider of choice gives you. The possibilities are both endless and vendor-specific, so we couldn’t document all of them.
The hypervisor provides some level of availability that if the underlying hardware has a problem, the VM can shift with vMotion or a similar feature.
A single VM can have software problems within the VM and then lose availability. Clustering could help with that, but the documentation says that clustering is not recommended for virtual machines with CheckMK.