Master/Slave? Backup VM? Live Status? I need a resilient backup

mgillespie1981 · October 4, 2023, 8:02am

I want to configure a ‘backup’ checkmk instance in case any issues arise with the production one so we can at least see the ‘live status’ of nodes.

I’ve read a few forum posts around a read only live status instance and even a master/slave setup?

I can’t seem to find any official documentation on these.

What’s the best solution for this? Would it be simply to create a new VM and restore a checkmk backup to this server? Or setup the said master/slave or ready only live status scenarios?

How do I configure the following which sounds like what I need here?

“For a Slave site use every time a “blank” site and connect this site to the Master.”
Distributed Monitoring - Slave Server Role, Wato Folder and Criticallity problem - Troubleshooting - Checkmk Community

rprengel · October 4, 2023, 8:18am

Hallo,
the appliance may solve this problem for you with the integrated cluster-feature.
ralf

mgillespie1981 · October 4, 2023, 8:19am

Is this what i’m looking for? Configuring a distributed monitoring ‘slave’ to the blank new site?

Is there any documentation on setting this up?
i.e do I configured this connection on the main site and push the config to the blank new site?
Use Livestatus - does this mean it will therefore be a read only backup server?

mgillespie1981 · October 4, 2023, 8:20am

The appliance is not an option at this time.

mgillespie1981 · October 4, 2023, 10:57am

Distributed monitoring - Scaling and distributing Checkmk

Would this be the best option to have a resilient ‘backup’ using livestatus and a read only distributed remote instance of checkmk?

mgillespie1981 · October 4, 2023, 5:14pm

Looking more into this it seems distributed monitoring isn’t the right solution. But maybe a simple backup and restore to the backup server/installation of checkmk would suffice?

andreas-doehler · October 4, 2023, 6:37pm

Exactly, not for your needs i think.
You can achieve your goal with the help of livedump or cmcdump.
Here Distributed monitoring - Scaling and distributing Checkmk
are some more information how to implement it.
In the end you get a second site that is really read only und shows after an crash of the main monitoring site the last status of this system.

mgillespie1981 · October 4, 2023, 7:58pm

This doesn’t sound like it’s something I can use for this either to be honest.

What I really want is to have a second replicated site on a different server in case of failure.

Is there any setup that can replicate?

The only thing I can think about is to literally create a backup and then restore it onto the other (backup) server. But this would then have to be carried out regularly.

andreas-doehler · October 5, 2023, 7:52am

Here is the question what do you really want?
If it should work as a real fail over site then you need to build something like the appliance setup with mirrored storage (DRBD or something else).
I have some systems with a normal heartbeat - corosync setup that work as active-active cluster. There every site is a cluster resource and you can move it freely between the cluster nodes.

If you want to do this with only one site then the virtual appliance is the easiest way as @rprengel mentioned. The time you need to setup such a own build cluster is way more than the cost of the appliance.

mgillespie1981 · October 5, 2023, 9:06am

So we did have a virtual appliance but my colleague said there is issues with this in Azure. We are moving to Azure.

High level - is it possible with the checkmk cloud edition to setup clustering for replication or is this only possible with the physical appliances?

mgillespie1981 · October 5, 2023, 9:12am

How do the agents communicate with the checkmk server?

For example if we had a replicated server in Azure, and we started using the backup server (which is a different IP address) would that still be able to bring in live data from the hosts? Or would that break because the server would be a different IP address?

andreas-doehler · October 5, 2023, 9:36am

This has nothing to do with the CMK edition directly - more with the OS under it.
I don’t know if the CMK teams is working on a way to deploy the appliance inside Azure.
It would be also problematic to setup the DRBD between Azure and on premise, i think.

If you use the TLS encryption then it should be no problem for normal agent hosts.

Personally i would not build such a setup with an “backup” instance inside any cloud provider. A instance inside a cloud provider i use to monitor this cloud infrastructure and that’s it.

mgillespie1981 · October 5, 2023, 10:46am

I guess it’s something I need to talk to infrastructure about then as I don’t fully understand how replication and backups and potential downtime works in Azure.

But in theory what you said above with tls encryption with the agents they would still be able to talk to the checkmk instance in a backup server should the ip address be different and therefore still give pull and return live data?

andreas-doehler · October 5, 2023, 11:02am

If it is the same instance yes - as it is if you use something like DRBD or any other replication method. For a different instance you need to register the agent to booth instances.
I have one system working this way - second CMK server is only “hot standby” and has only the configuration, rrd data and autochecks replicated. If the primary monitoring server goes offline you need to manually change the monitoring instance inside the main folder of config to the second instance and push the config to this machine.
That’s a three machine setup - web frontend, worker1 and worker2.
But such a setup is a little bit unique as it is the only one in my over 150 CMK systems.

mgillespie1981 · October 5, 2023, 11:26am

We are thinking about using high availability sets in Azure.

So my next question would be:
If the service would go down in this scenario what happens to the alerts that are triggered during this time? In other words do they get lost or does checkmk re-check everything after coming back online?
Is there a cache of information in the agent that would then be sent to checkmk when the service is available again?

Anders · October 5, 2023, 4:17pm

same question in this thread Clustering CheckMk - #3 by Anders

mgillespie1981 · October 5, 2023, 6:09pm

This sounds like a possible solution. Is there any documentation on configuring keepalived and rsync as it could be a solution for us

mgillespie1981 · October 5, 2023, 8:58pm

@Anders ok so I’ve had a, read about keepalived and rsync.

From my basic understanding keepalived is used to transfer the same IP address from the master to the backup linux server should failure occur. This means that all agents will still be communicating with essentially the same server and will be able to give live data again as if it’s the master.

Also then rsync is used I presume to periodically send a copy of all the checkmk directories to the backup server.

Is that how this is working for you? I presume therefore that a full copy of the directories will hold all the configuration and host data so when the backup server is started checkmk is the same?

mgillespie1981 · October 6, 2023, 7:33am

Would there not be files locked when the main site is up that couldn’t be copied?

fmonts · October 6, 2023, 10:01am

Can’t you just move the IP from the old instance to the new one?

This is what I do, also because all my hosts have a firewall rule (ufw allow from [checkmkIP] to any port 6556 proto tcp) that allows connection on 6556 only from my checkmk server.

For disaster recovery I’d just create backups with omd backup and restore them to a new instance if ever needed…