Slave site host list got messed up

Hi There,

We have a distributed monitoring setup.
We had been monitoring over 17000+ hosts with 11 slave sites.
In that one of the site got messed up. (dont know what happened)
Changes are not activating. It says another restart in process. Aborting.
I tried to run cmk -R, cmk -O, cmk -U. But still same issue.
Later i noticed the no. of hosts. It has hosts from other sites showing up as part of this site.
Not sure how to remove it.

It should have only 800 hosts but it is showing over 8000+ hosts which is being monitoring in other sites.

How to fix this issue.
Any solutions please help.

Thanks,
Shivdev

When a “cmk -R” does not help anymore, you can try to stop the whole site (omd stop), kill all hanging processes (or even reboot the host), and start the site again.

If a site is really messed up and cannot be repaired, for whatever reason: a remote (slave) site in a Distributed WATO setup is essentially dumb. You can always shut it down (omd rm), create a new site (omd create) and reconnect the central site to it. You will just lose the historic monitoring data.

For this problem i would check the file “~/etc/check_mk/conf.d/distributed_wato.mk”
Inside you should find the name of your slave site.
This name should also be found inside the “~/etc/check_mk/multisite.d/sites.mk”

Looks like that file is missing

Would it be just as simple as adding the file back with the site name?

distributed_wato_site = ‘sitename
is_wato_slave_site = True

Well that worked :slight_smile:
Added the file and changed the site name
Followed by cmk -R, cmk -O and omd restart

If i had read your comment earlier then i would have also suggested this solution :slight_smile:

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.