We have a distributed monitoring setup.
We had been monitoring over 17000+ hosts with 11 slave sites.
In that one of the site got messed up. (dont know what happened)
Changes are not activating. It says another restart in process. Aborting.
I tried to run cmk -R, cmk -O, cmk -U. But still same issue.
Later i noticed the no. of hosts. It has hosts from other sites showing up as part of this site.
Not sure how to remove it.
It should have only 800 hosts but it is showing over 8000+ hosts which is being monitoring in other sites.
When a “cmk -R” does not help anymore, you can try to stop the whole site (omd stop), kill all hanging processes (or even reboot the host), and start the site again.
If a site is really messed up and cannot be repaired, for whatever reason: a remote (slave) site in a Distributed WATO setup is essentially dumb. You can always shut it down (omd rm), create a new site (omd create) and reconnect the central site to it. You will just lose the historic monitoring data.
For this problem i would check the file “~/etc/check_mk/conf.d/distributed_wato.mk”
Inside you should find the name of your slave site.
This name should also be found inside the “~/etc/check_mk/multisite.d/sites.mk”
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.