Best Practices Backup/Restore checkmk

Hello,
i have a general question about backup/restore of a distributed monitoring checkmk environment. We currently have one master and 4 slaves, probably 6 slaves in the medium term. These sites run on VMWare systems with RHEL7. I have defined a local backup job on each site and additionally a VMWare snapshot of the respective VVM is created once a week. Do you have experience with restoring VMWare snapshots, is there anything to consider? Must/should checkmk always be stopped when a VMWare snapshot is created?
Is it better to restore a VMware snapshot in case of content problems that cannot be fixed so easily again, or rather a backup that was created by checkmk itself?

I am grateful for any tips and experiences regarding backup/restore of checkmk environments.

VG
Christian

If my monitoring servers running inside VMware or HyperV i use the normal backup and restore mechanic from the infrastructure. That means also include these machines in the normal backup (daily, weekly and so on).
If you use something like Veeam as backup then your monitoring slave is up in under 5 minutes from the backup. That is the biggest advantage against “normal” site backup and restore from the tar.

Hello @andreas-doehler,
that means, you can easily snapshot and restore a running site?! Sometimes applications don’t like it when they are backed up during operation or when a VM snapshot is created.

Anyway, very good news that this is no problem with checkmk.

CMK has database or anything like that. The only lost data is the same that is also lost if you use the normal site backup (/tmp directory).
But that is not important as most times you make a backup one time a day.
What is very handy is the Veeam method to create a standby VM on another host and actualize it one time per hour or so. That’s a cluster with possible data loos of one hour. Not bad for doing nothing :slight_smile:

Hi @andreas-doehler
thanks for the information. One more concrete question, then I think I’m done :smiley:
For example, when I restore a slave site using a VM snapshot and in the meantime changes have been made within checkmk. How does the site react when you start it up again? Does the master simply push the missing changes back to the site?
And same question if the master needs to be restored, but the sites have newer changes than the master snapshot. What does that look like?
I’m planning some backup/restore Tests in mein Test-Environmet, but maybe you can answer these questions before I’ll start.

Thanks,
Christian

You need to create a dummy rule or make some changes that mark your restored site “dirty” then the push will occur. If you do no changes after the restore the site will run with the old config until something is changed that need’s a config push.

This would be “bad” as the old config from the restored master overwrites the newer slave config.
For the master some more snapshots like Veeam can do would be good.

Okay, thank you so much!