Checkmk Collector lost metrics of large Kubernetes-Clusters

schmidax · May 22, 2024, 7:33am

CMK version:
2.2.0p16
OS version:
kubernetes v1.27.12+rke2r1
Error message:
It is not really an error but a phenomenon, because checkmk randomly loses the values for the CPU and RAM usage pods

Solution:
After days of searching for the problem, I found out that the --cache-maxsize value is set to 10000, which means that only max 10000 metrics are kept, which is not enough for large clusters like mine. Therefore this option must be set, which was not possible via the Helm chart, but I have added it. See pull request: Add chart option for cluster-collector to set cache-maxsize if a k8s-cluster generate more than 10000 metrics #27

robin.gierse · June 10, 2024, 7:50pm

Hey @schmidax! Do you want to post your solution as a dedicated commend and mark it as the solution? That way, it is easier for everyone to see it. Thanks!

schmidax · June 11, 2024, 5:45am

Solution:
After days of searching for the problem, I found out that the --cache-maxsize value is set to 10000, which means that only max 10000 metrics are kept, which is not enough for large clusters like mine. Therefore this option must be set, which was not possible via the Helm chart, but I have added it. See pull request: Add chart option for cluster-collector to set cache-maxsize if a k8s-cluster generate more than 10000 metrics #27

system · June 11, 2025, 5:45am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.