Hi all,
we have upgraded our environment from 1.6.0p21 to 2.0.0p12 (CEE) and noticed a strange behavior for cached local checks which are running in a Cluster environment (rule “Clustered services” and “Clustered services for overlapping clusters”).
The local check script is located on both cluster nodes in the directory lib/local/300, the caching should be for 5 minutes.
Rule “Local checks in Checkmk clusters” is set to Best state for these services.
From time to time we have the issue that the Cluster Service returns the data from the false node (with the worst state):
-
When the service is in the output on both sides the false node is reported
-
When the service is only in the output of one node of the cluster, the service state will go to Unknown (Item not found in monitoring data)
The issue exist only for one polling time, so we can use the the “Maximum number of check attempts for service” rule as a workaround to suppress wrong notifications.
The issue does not exist when the same script is running as a cached local check in a non-cluster environment or when we use it as local check in a cluster-environment without caching.
Has someone an idea what can be the root cause of the issue?
Best Regards
Thomas