Kubernetes Cluster Collector not getting the CPU/Memory Percentage Usage metrics

ronald_sp · June 16, 2023, 3:36pm

Hi ,
I have followed the Video Tutorial regarding “Monitoring Kubernetes”, I I have setup everything like in the video, connection is working however I’m not seeing , CPU - Memory Metrics

When I go to Kubernetes Monitoring

And This is full display

This is my setup inside the DHM

Would appreciate if someone know what could be wrong , let me know if you need any more pictures.

ronald_sp · June 19, 2023, 8:26am

Does anyone know what could be wrong ?

SvenKt · June 19, 2023, 12:34pm

I have the same issue. I installed the Kubernetes Monitoring Agent into an GCP GKE Autopilot cluster. I am missing the usage metrics on CPU and Memory for every pod / deployment and so on. I only see the request and limit values for the pods. Does anyone know a possible solution?

ronald_sp · June 21, 2023, 4:46pm

Hey @ SvenKt, did you got any success with getting CPU / Memory Metrics ?

morbloe · June 23, 2023, 8:25am

maybe @martin.hirschvogel has an idea on what could be wrong

martin.hirschvogel · June 23, 2023, 12:41pm

Hey,

what I can see from here, is that the Cluster Collector service informs us, that the cluster collector is properly deployed, and we can also see information on Machine Metrics being delivered. That also means, it is properly installed.

However, there are no Container Metrics send. Therefore, we don’t see any metrics on the containers.
This typically means, that the cAdvisor, which we are using is not being able to collect any metrics.

Can you please share with us the output of
kubectl get nodes -o wide
This way, we can see which container runtime is running on the nodes.

And, I need the information which Kubernetes distro and version you are using.
Kubernetes != Kubernetes and the differences are quite big between different K8s flavours and implementations.

martin.hirschvogel · June 23, 2023, 12:42pm

Your problem might be very different. There are often many different sources. Therefore, please create a thread yourself or even better contact the Checkmk support.

If anyone of you needs timely support, please create a ticket. This is a community support forum. I only typically answer, if it is related to a new problem, which we need to investigate then ourselves.

morbloe · June 23, 2023, 12:55pm

@SvenKt is working with me. Information on our test-cluster:

kubectl get nodes -o wide
NAME                                           STATUS   ROLES    AGE    VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gk3-test-apc-priv-default-pool-0e9dd36c-zhco   Ready    <none>   6d6h   v1.25.8-gke.1000   10.156.0.8    <none>        Container-Optimized OS from Google   5.15.89+         containerd://1.6.18
gk3-test-apc-priv-default-pool-db515e1b-qwjn   Ready    <none>   6d6h   v1.25.8-gke.1000   10.156.0.9    <none>        Container-Optimized OS from Google   5.15.89+         containerd://1.6.18
gk3-test-apc-priv-pool-1-eab79ec1-ovbx         Ready    <none>   6d4h   v1.25.8-gke.1000   10.156.0.10   <none>        Container-Optimized OS from Google   5.15.89+         containerd://1.6.18

This is an autopilot-cluster in GCP

morbloe · June 26, 2023, 7:11am

But to be clear @martin.hirschvogel checkMK should report used-metrics as well? Not only requests and limits?
Also to narrow it down: is there a differentiation between, memory used, memory allocated, memory requested (k8s), memory limit (k8s)?

How could we debug missing metrics from our side, as - like you pointed out - there a many different setups. This could also lead to a wider support by checkMK if we (the users) could provide better error-reports.

martin.hirschvogel · June 26, 2023, 5:50pm

Yes, Checkmk should report used metrics as well, e.g. how much CPU and Memory is actually used and how that relates to the requests and limits to e.g. detect OOM and CPU throttling.

All of the metrics mentioned by you should be included.
If not, then there is an issue, which we would have to take a look into.

One issue might be around how Auto Pilot creates and restricts Kubernetes clusters, which is apparently different to Google Kubernetes Engine.

If the metrics are not there, then look at the Cluster Collector what is says (e.g. no container metrics).
If this is the case, look at the log of the cadvisor container inside the node-collector-container-metrics.

Whatever is provided in there, helps us to understand what the issue is.

SvenKt · June 27, 2023, 8:32am

Hello @martin.hirschvogel, the cluster collector does not output any error messages.
When I check the logs of this pod I just get a bunch of these lines:

IP:Port - “GET /health HTTP/1.1” 200
IP:Port - “POST /update_machine_sections HTTP/1.1” 200
IP:Port - “GET /health HTTP/1.1” 200
IP:Port - “GET /health HTTP/1.1” 200

Regarding the node-collector-container-metrics there is no such daemonset in our cluster. I installed the helm chart according to the documentation Kubernetes überwachen . I just checked the actual deployed resources against the helm chart’s manifest. So the container-metrics daemonset hasn’t been deployed and “helm list” shows a status “failed” with no further information why.

SvenKt · June 27, 2023, 8:51am

I ran the deployment of the helm chart again with --debug to catch the error messages and got the following:

Error: 1 error occurred:
* admission webhook “gkepolicy.common-webhooks.networking.gke.io” denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {“[denied by autogke-no-write-mode-hostpath]”:[“hostPath volume var-run in container cadvisor is accessed in write mode; disallowed in Autopilot.”,“hostPath volume sys used in container cadvisor uses path /sys which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].”,“hostPath volume docker used in container cadvisor uses path /var/lib/docker which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].”]}

helm.go:84: [debug] 1 error occurred:
* admission webhook “gkepolicy.common-webhooks.networking.gke.io” denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {“[denied by autogke-no-write-mode-hostpath]”:[“hostPath volume var-run in container cadvisor is accessed in write mode; disallowed in Autopilot.”,“hostPath volume sys used in container cadvisor uses path /sys which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].”,“hostPath volume docker used in container cadvisor uses path /var/lib/docker which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].”]}

martin.hirschvogel · June 27, 2023, 11:11am

Thanks, that explains it. No permissions, no data.

See similar issue for Elastic: GKE AutoPilot - No Write Mode Hostpath · Issue #1126 · elastic/helm-charts · GitHub
Quick further googling has shown that the same issue applies for many other monitoring tools, which require certain privileges to access the underlying data.

From my view, GKE Autopilot needs to add the option to allow specific containers to have privileges. They do not currently, as you can see in:

Another option for us would be to become a Google Cloud partner at Checkmk. We can evaluate, if this is possible at feasible effort.

morbloe · June 28, 2023, 3:17pm

@martin.hirschvogel Monitor GKE Autopilot With Datadog | Datadog seems to provide a helm-chart that collects also used-metrics. So there seems to be already a way to do it. Did you took a look at:
helm-charts/charts/datadog at main · DataDog/helm-charts · GitHub to figure out how they do it?

martin.hirschvogel · June 28, 2023, 3:59pm

Yes, because I believe Google allows them to run privileged containers in AutoPilot.
Also, not all of their functionality is available as eBPF is not possible there, which is a pillar for their monitoring.

Interestingly, we are using google/cadvisor for monitoring containers (GitHub - google/cadvisor: Analyzes resource usage and performance characteristics of running containers.). And this is not allowed in GKE AutoPilot

But we will try to go the same route as DataDog has and check out, if we can get GKE AutoPilot to allow our workloads as well.

morbloe · June 29, 2023, 6:42am

I could be wrong, but isn’t their approach to use prometheus? Of course this raises the question on how prometheus is collecting the metric, but prometheus is also available without an own deployment, this could be a source today and an alternative to kindly ask google to be set on an allow-list.

martin.hirschvogel · June 29, 2023, 9:48am

I doubt it. They use the DD agent from what it looks like in the Helm chart.

Do you have a Prometheus running in AutoPilot? If so, which way did you use to deploy it?
Because if a Prometheus is set up in a certain way (basically identically how Prometheus is setup in OpenShift by RedHat), then we can also use it for monitoring Kubernetes clusters.

morbloe · June 29, 2023, 12:16pm

I’m not sure about how they did it, I just saw the requirement for their helm-chart referring to Prometheus Community Kubernetes Helm Charts | helm-charts

As to if we are already using prometheus in our cluster: NO - but my idea was that you add it to your setup for monitoring GKE autopilot in a minimal needed configuration as a workaround in GKE autopilot.

ronald_sp · July 10, 2023, 4:55pm

Hi @martin.hirschvogel , I have now tried again But still having same problem not being able to see CPU/Memory Metrics.

Any clues how I might further debug my situation, seems like where ever I add K8s Monitoring via Node Collector I was never able to see CPU/Memory metrics.

martin.hirschvogel · July 11, 2023, 5:55am

Hey Ronald, what kind of K8s is this? Please more information on that.