Automatic service discovery does not remove vanished services on cluster hosts

ttrafelet · May 10, 2022, 1:53pm

Hi all

We are running a distributed monitoring setup with CME 2.0.0p21.

To monitor a simple DNS solution using two nodes, we wrote a simple local check that checks whether domains are resolvable. The check outputs one service for each domain. Then, we created a cluster host on Checkmk and assigned the DNS services to it. This part works flawlessly. We get the clustered services on the cluster host and no longer see them on the nodes.

The next step we wanted to take was to automate the discovery of these DNS services, including automatic activation of changes. However, we are unable to get this configuration fully working.

The discovery rule looks as follows (it is the only one that is matching for both nodes and the cluster object):

Currently, this rule has the following effect:

Discovery is run every 5 minutes
New services are automatically added
Vanished services are not automatically removed ← This is our issue.

Based on the configuration above, we would have expected for vanished services to be automatically removed as well. This is not the case, though. The output of Check_MK Discovery looks as follows:

During troubleshooting this issue, we already tried the following to no avail:

Removing the nodes from the discovery rule, leaving only the cluster itself
Removing the cluster from the discovery rule, leaving only the nodes
Using dedicated deny-/allowlists for adding/removing services

Have any of you got this working?
Are we missing something?

Kind regards
Thierry

mrei117 · January 10, 2023, 10:17am

I have the same problem. Does anyone got a solution. I dont know why, but periodic service discovery for vanished services doesnt work. I tried with different methods.

ttrafelet · January 10, 2023, 10:32am

Completely forgot about this post…
We are still facing this issue with 2.1.0p14. So this seems not to be an issue with 2.0 specifically.

moritz · January 31, 2023, 7:04pm

Bad news: It’s currently not possible. Good news: We consider this a bug.

This is somewhat on purpose, but I don’t think all implications where thought through when this was implemented.
There is no service discovery on a cluster, really. The clusters services are a configured subset of all nodes services. Accordingly, the “periodic service discovery” is only run on nodes. When the “Check_MK Discovery” service of a cluster detects changes, a discovery on all its nodes is scheduled.

Now, the data is not fetched from all nodes at precisely the same time, but whenever its last checking took place. If the dataset of node A was fetched before a service moved nodes, and the dataset of node B is fetched after that, the services data is either seen twice (if the service moved from A to B) or not at all (if the service moved from B to A).
During discovery vanished clustered service are therefore not discarded, because otherwise we might loose services that are merely moving from one node to another.

As a result, we see the behaviour that you are describing, which is not good. AFAIK there’s no ways to solve this in an automated manner.
I discussed this with a colleague today, and we think the resulting problem might be just as big or even bigger than the solved one here.

The only idea we currently have is to make this choice configurable: Either vanished clustered services are dropped from the node (at the risk of loosing dervices during a fail-over) or they are kept (with the consequence you are experiencing).
What do you think?

ttrafelet · February 2, 2023, 1:02pm

Ah, I see. Now that you told me it seems obvious why it would behave this way
Actually, now that I think about it, it might be quite possible there is no best solution for this case.

I think your proposal to make this configurable would alleviate some (if not most) of the cases.
For us at least, this would solve our issue.

Just brainstorming here: Would it be possible to somehow make an automatic removal of vanished services on clusters dependent on whether all the nodes have been checked at least some X times?
This would allow extending your proposal by an option like “Only remove a vanished service if no nodes reported it during X check intervals” (where X could be configurable).
Or would this make no sense anyway?

(There are still edge cases, though. For example a service that is rapidly switching between the nodes, for whatever reason.)

moritz · February 3, 2023, 9:26am

Thanks for your feedback! I’ll make this configurable.
I also thought about something similar you mentioned, but all those approaches are error prone, hard to implement/maintain and still (as you said) not 100% perfect. I think just make it configurable is an reasonable trade-off.
I’ll drop you a link to the Werk when I’m done.

moritz · February 8, 2023, 8:45pm

As promsied: Periodic service discovery: Vanished clustered services can now be removed automatically

system · February 8, 2024, 8:45pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.