Hello,
After our latest Kubernetes Update we experienced very frequent outages of the connection between checkmk and the Cluster collector:
Cluster collector version: 1.7.0, Nodes with container collectors: 3/3, Nodes with machine collectors: 3/3, Container Metrics: OK, Machine Metrics: Setup Error**CRIT**
or:
Cluster collector version: 1.7.0
Nodes with container collectors: 4/4
Nodes with machine collectors: 4/4
Container Metrics: OK
Machine Metrics: Setup Error(Failure to establish a connection to cluster collector at URL [<redacted>](https://<redacted>/machine_sections) )**CRIT**
Node: <redacted>(Container Metrics: Checkmk_kube_agent v1.7.0, cadvisor_version v0.47.2; Machine Sections: Checkmk_kube_agent v1.7.0, checkmk_agent_version 2.2.0p12)
Node: <redacted>(Container Metrics: Checkmk_kube_agent v1.7.0, cadvisor_version v0.47.2; Machine Sections: Checkmk_kube_agent v1.7.0, checkmk_agent_version 2.2.0p12)
Node: <redacted>(Container Metrics: Checkmk_kube_agent v1.7.0, cadvisor_version v0.47.2; Machine Sections: Checkmk_kube_agent v1.7.0, checkmk_agent_version 2.2.0p12)
Node: <redacted>(Container Metrics: Checkmk_kube_agent v1.7.0, cadvisor_version v0.47.2; Machine Sections: Checkmk_kube_agent v1.7.0, checkmk_agent_version 2.2.0p12)
You can see the ouatges in the usage metrics of the memory ressources:
We did not managed to fix the error and tried Combinations of Kubernetes 1.27 and 1.28 with helm charts 1.5 and 1.7. The nodes after the Update user another VM Template containing the updated Kubernetes Version.
Our Checkmk has the Version 2.3.0p20.
Even with debugg logging, there are no clues. The pods are running flawlessly from what we can see in ArgoCD.
Best regards.
