Ah, I see. Now that you told me it seems obvious why it would behave this way
Actually, now that I think about it, it might be quite possible there is no best solution for this case.
I think your proposal to make this configurable would alleviate some (if not most) of the cases.
For us at least, this would solve our issue.
Just brainstorming here: Would it be possible to somehow make an automatic removal of vanished services on clusters dependent on whether all the nodes have been checked at least some X times?
This would allow extending your proposal by an option like “Only remove a vanished service if no nodes reported it during X check intervals” (where X could be configurable).
Or would this make no sense anyway?
(There are still edge cases, though. For example a service that is rapidly switching between the nodes, for whatever reason.)