Application Cluster for checkmk itself

I was surprised not to discover any application based checkmk solution how to setup a cluster. IMHO this should be available for a product with a business critical approach. So I don’t want to monitor a cluster. I would like to be a cluster “myself”

We currently do have the following setup:
Here we are happy:
Managemgent server that just syncs config changes to the distributed boxes. Here we can afford downtime.
Frontend servers for the end users. These boxes just do a livestatus connection. Besides notifications and reports they are redundant. Here we are also ok.

Here we are unhappy:
With our distributed monitoring. In case of downtime we loose at lease performance data (when we retrieve an older backup). However and even worse is that during rebuild of these machines - might take up to an hour - even though it is up to 95% automated - the checks are simply unavailable.
Now we would like to have a solution on the application layer that helps us to use two checkmk instances with different IPs and without any DNS switch.
We simply would like to sync these “backend” instances and use them in parallel that there is no downtime for the user. Like we are able to do it with frontend systems.

So we simply would like to do the following: if distributed istance1 is gone just use distributed instance2.

I know about corosync and drbd. These were all cool tools in the past but in my case I don’t want to use them any longer.

May have a look at the checkmk appliance. It also use corosync and drbd but you dont need to care. All is configured in a nice web gui.

regards

Michael

2 Likes

The price for the (virtual) appliance is quite low when comparing that to the time needed to get a corosync drbd cluster running by yourself.

Best way for us would be to run it in an openshift env.

I know there is a docker image however just one instance makes the whole env less available.

For the distributed sites, you could run them as pods in a Kubernetes environment. Expose them via an ingress (or route in OC) and use a persistent volume for config&data storage.
Then you don’t have to worry about different IPs as the ingress + service inside Kubernetes will always direct to the correct pod. There will be a downtime as during an update / restart not both pods can be running at the same time and accessing (or rather writing to) the volume.