Ideas for HA / DR

Good day!

Currently running CEE 1.6.0p9 using the virtual appliances. I have one master node and two slaves. One slave exists in my secondary data center and monitors everything in there while the other slave monitors everything within my primary data center along with all of my field offices equipment. Trying to figure out the best way to do either a high availability or disaster recovery setup for this environment. I did look at the clustering capabilities of the appliances but I’m not sure if this requires a layer 2 connection between the appliances or not and I don’t see that anywhere in the documentation. What I was thinking was just replicating the master server with Veeam daily to our secondary datacenter and, in the event of a disaster, bring the replica online and just change the monitoring source for all of my field equipment to the slave I have in my secondary data center. It’s not elegant but it should work.

I’m posting here to see if maybe someone has a solution that would require less work to be done when disaster strikes.

Thanks in advance.

1 Like

You need two separate layer 2 connections between the appliances to create a HA active/passive cluster: https://checkmk.com/cms_appliance_usage.html#Failover%20cluster

Under the hood a corosync/pacemaker with DRBD is created. DRBD is especially picky when it comes to latency.

When you are going with the cold standby solution just make sure that all your monitored hosts accept both the primary IP and your cold standby IP as request source (checkmk agent and SNMP agent config and/or local firewalls).

2 Likes

ok so this option is out of the question since I have only a layer 3 connection between data enters. Is it possible to connect to “master” servers, if you will, to the same distributed monitoring slave? Or would I be stuck with just replicating the master to my secondary data center and then just updating the site that does the monitoring in WATO?

Distributed WATO allows only one central site for configuration as everything on the remote site gets overwritten.

You can connect any number of sites to view monitoring data from a single remote site, but that is not something you need in your situation.

A cold standby by any means would be the solution for you.

And in this case the cold standby being the offline replica of my central site. So I’m looking at a design like this then.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.