Check_MK Distributed Monitoring - Notifications crashed - backup, restore, push configuration to remote site - python script problem

blazej.s · April 21, 2023, 5:24pm

CMK version: 2.0.0p34 - Slave, 2.0.0p24 Master
OS version: Ubuntu 22.04 LTS - Slave, CentOS 7 Master

Error message:

Hi, we have a problem with distributed monitoring. Our slave host has died due to a hardware failure. We restored the virtual machine from the array snapshot copy. Everything worked fine, all configuration files were restored, but our alerts are not reaching the master host → no e-mail notification. We figured out that our site is broken because when we create a new one, as a blank site on the same machine, and add a one host after connecting to distributed monitoring, notifications work perfectly. I think we have tried everything with no success. We have even copied raw files from one site to another with all the dependencies, but when we edit a host, everything disappears.

So we basically created a new machine with the same initial server, checkmk and network configuration. Our CMK master has all the rules and hosts. After connecting the new site, all hosts are visible (without services) and we can’t login to the remote site.

obraz

Question is, how can we send those changes to Slave host, without 110s timeout.

We even restored from omd backup, but it transfer those python errors.

Is there any other way to salvage our situation or do we need to manually configure host/services/agents one more time.

The remote site had almost 7k services. I even added RAM and CPU to both sites, still we cannot save our configuration.

Below i attach what we could find:

Sincerely,

robin.gierse · April 25, 2023, 3:16pm

Hi @blazej.s!
I read over your description and I think this is way out of scope for the forum, just speaking from a time and complexity point of view. Without diagnostic data or a support session, this will be next to impossible to troubleshoot. Maybe one of the gurus in here has ideas anyway.
But I recommend you open a support ticket, so a support engineer can look into this matter in more depth and due time.

blazej.s · May 5, 2023, 8:05am

Hi @robin.gierse and thanks for the reply.

After some more tryings we solved our problem in pretty good way.

Solution in distributed monitoring is to create new virtual machine or different site with different livestatus port. The next step is to add a new client and a new distributed connection, leaving the existing one as it is. After successfully connecting the new site (B), we need to go to Setup → Host and change the folder properties with the configuration of the old site (A) and change “Monitored site” from A to B.

obraz

Next, we adopt all the changes in our distributed monitoring to be clear with everything. When we change the monitored folder on site, each host is moved from folder A to folder B and site, with only host status or PING being displayed, depending on the configuration. In this steps rules are not implemented.

And this is where things get tricky. All the hosts are missing services and rules, but they are here in our environment in Master configuration. For a test, we need to choose a host that we remember that he has some services added, then when we perform service discovery, after a few seconds in our eyes will be prompted list of services that are already monitored, disabled or new. The trick is to add, remove or disable one of the services to update the data and assign rules. Then we go to Accept changes and any rule that was applied in site A will be moved to site B for that host.

This was our test, if it works we can go to Setup → Host and do a bulk discovery of folder B. Our environment has 362 hosts with 6k services. The bulk finished after 7 hours. After some time we need to go to Setup → Background Jobs, find our discovery and check if the status is complete.

obraz

Then we have to repeat the tricky step because nothing will show up in our changes. So select a one host, do service discovery, add/remove/disable service, go to changes and activate. If there was a bulk run and it was finished, it was for all the hosts we have in the B folder and subfolder, even if the information is not displayed

Relax, if bulk discovery has completed, even with errors, this is information for us that CMK could not perform discovery because services are already being monitored and python script errors.

Then, as if with a magic wand, all hosts will have a service and rules properly moved from site A to site B. The only disadvantage is that we lose all historical monitoring data, but we do not have to configure everything from scratch or from the OS.

One last thing, in distributed monitoring, Site A and Site B need to have the Enable replication → Push configuration to this site option checked.

When we migrated, we didn’t make any changes to folders or rules, so everything was moved 1:1.

If anyone has any questions please feel free to ask, I hope everything I have described here will be helpful to someone and save many hours or days of repair time.

So far I have not deleted folder A, if I do I will update the thread on the forum if everything still works fine.

Sincerely,

robin.gierse · May 9, 2023, 10:43am

I am uncertain if I got everything, but this whole situation sounds very weird.
I have a very strong feeling there are problems way underneath the effects you see.
I am talking misconfigurations, misunderstandings and maybe even modifications under the hood.
Without a live view on the environment it will be impossible to give definitive statements though.
But I am very certain, that the solutions you describe should not be necessary to replace a failed Checkmk server. Far from it. It is very straight forward if there was no tampering.

Anyway, I am glad you could fix your situation.