Issue with cached local checks in a cluster environment (2.0.0p12)

T.Schmitz · October 7, 2021, 9:20am

Hi all,

we have upgraded our environment from 1.6.0p21 to 2.0.0p12 (CEE) and noticed a strange behavior for cached local checks which are running in a Cluster environment (rule “Clustered services” and “Clustered services for overlapping clusters”).

The local check script is located on both cluster nodes in the directory lib/local/300, the caching should be for 5 minutes.

Rule “Local checks in Checkmk clusters” is set to Best state for these services.

From time to time we have the issue that the Cluster Service returns the data from the false node (with the worst state):

When the service is in the output on both sides the false node is reported
When the service is only in the output of one node of the cluster, the service state will go to Unknown (Item not found in monitoring data)

The issue exist only for one polling time, so we can use the the “Maximum number of check attempts for service” rule as a workaround to suppress wrong notifications.

The issue does not exist when the same script is running as a cached local check in a non-cluster environment or when we use it as local check in a cluster-environment without caching.

Has someone an idea what can be the root cause of the issue?

Best Regards
Thomas

ChristianM · November 3, 2021, 9:26am

Hi Thomas,
I could recreate your problem. This is a strange behaviour, that the cache data lost for one cycle and “Item not in agent output” occur.
Is there anybody who can also confirm it, also at triebe29?
Best Regards,
Christian

moritz · November 9, 2021, 8:33am

I’m on it.
I think this already fixed in the daily builds.

moritz · November 11, 2021, 1:50pm

It should be fixed now: "Item not found" for cached local checks on clusters

system · November 11, 2022, 1:51pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.