Master/Slave - backup/recovery/resiliency

Hi,

I am just putting together my CheckMK solution, and was wondering how people are managing their CheckMK implementations out in the real world?

1/ How are people creating any HA from a non-appliance implementation
2/ Backup and Recovery strategy for Masters and Slaves
3/ Based on the point 1, are people using a Cloud VM as a Master to workaround issues with having to think about and configure HA. i.e just doing a Master Backup and then doing a recovery for any failures.

Really appreciate any replies.

1 - Master is running inside a HA environment like VMWare
2 - Master backup the whole site as all config data is there, for slaves the more important data is the event and history data - if you have a good backup include all your machines in the normal backup for the other systems
3 - “Cloud” is not the important thing, it is more that the master is a VM is it does normally no checking only web front end and central configuration

Thanks Andreas. I appreciate the time taken for your reply.

If i should lose the Master, for whichever reason, is it possible to promote a Slave to be a Master? Or…as a general question, if you lost the console functionality of the Master, how would you go about getting back/recovering estate visibility of everything while the Master got fixed…?

No

This can be done with an instantly created site with only the livestatus connection to your existing slaves. You cannot do config work but you see the status of all slaves.

So, in theory, I could have a full Master with Console (Livestatus) and WATO, plus a Livestatus only instance as a Console? Is there a limitation to the number of Livestatus central pollers?

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.