Automatic service discovery does not remove vanished services on cluster hosts

moritz · January 31, 2023, 7:04pm

Bad news: It’s currently not possible. Good news: We consider this a bug.

This is somewhat on purpose, but I don’t think all implications where thought through when this was implemented.
There is no service discovery on a cluster, really. The clusters services are a configured subset of all nodes services. Accordingly, the “periodic service discovery” is only run on nodes. When the “Check_MK Discovery” service of a cluster detects changes, a discovery on all its nodes is scheduled.

Now, the data is not fetched from all nodes at precisely the same time, but whenever its last checking took place. If the dataset of node A was fetched before a service moved nodes, and the dataset of node B is fetched after that, the services data is either seen twice (if the service moved from A to B) or not at all (if the service moved from B to A).
During discovery vanished clustered service are therefore not discarded, because otherwise we might loose services that are merely moving from one node to another.

As a result, we see the behaviour that you are describing, which is not good. AFAIK there’s no ways to solve this in an automated manner.
I discussed this with a colleague today, and we think the resulting problem might be just as big or even bigger than the solved one here.

The only idea we currently have is to make this choice configurable: Either vanished clustered services are dropped from the node (at the risk of loosing dervices during a fail-over) or they are kept (with the consequence you are experiencing).
What do you think?