Kubernetes Cluster Collector not getting the CPU/Memory Percentage Usage metrics

ronald_sp · July 13, 2023, 5:47pm

Hi @martin.hirschvogel
Could you give me an example what kind of information regarding the my K8s would be useful.

I think I have a similar problem like this person from this topic.

After adding this rule I still encountered missing metrics when I checked in Monitoring > Kubernetes

martin.hirschvogel · July 14, 2023, 9:47am

Hey Ronald,
if you want detailed support/troubleshooting, please contact a Checkmk partner or our support.
This is still a community support forum and I have limited time to help here and only do this is my free time in breaks or after work.

The problem you have has nothing to do with the configuration on Checkmk side. That is fine.
The problem you are encountering is that the container metrics are not collected.

We guarantee that this works for Vanilla Kubernetes, Google Kubernetes Engine (not Autopilot, which is handling GKE differently!), AWS Elastic Kubernetes Engine, Azure Kubernetes Service and OpenShift.
For Rancher, there are some slight changes you need to do, due to a different location of containerd on the nodes in that distro. On VMware Tanzu, it should work off-the-shelf as well.

However, regarding Kubernetes, there is a plethora of versions and distributions and setups, which all behave differently. There is a common misunderstanding that Kubernetes = Kubernetes.

Therefore, as asked before, I need to understand which Kubernetes you are using.
Are you using a managed service, have you set it up yourself from the scratch, are you using an enterprise distro of Kubernetes?
And then, which version are you using? Which container runtime?
How do you find that out?
This screenshot is a good indicator:

Your nodes are using RKE2. Which is good, because then we can make it work. I am a bit worried, because it could also be that you are using k3s - let’s hope not.
The solution to your problem can be found here: Kubernetes Cluster collector doesn't show CPU, memory usage or container metrics - #10 by chauhan_sudhir
We have Rancher support on our roadmap for Checkmk 2.3, which includes building in that change.

Please keep in mind that Rancher had a major bug recently, which relabeled all metrics internally, so that any monitoring system couldn’t work with it anymore. Thus, best to have it up-to-date.

ronald_sp · July 20, 2023, 10:08am

Thank you for you response.
I have added “–containerd=/run/docker/containerd/containerd.sock”
This time I got a bit more detailed response.

martin.hirschvogel · July 20, 2023, 10:41am

This is something very different. Where did you add that line?

ronald_sp · July 20, 2023, 10:48am

I have added it here
additionalArgs:
- “–housekeeping_interval=30s”
- “–max_housekeeping_interval=35s”
- “–event_storage_event_limit=default=0”
- “–event_storage_age_limit=default=0”
- “–store_container_labels=false”
- “–whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,io.kubernetes.pod.uid”
- “–global_housekeeping_interval=30s”
- “–event_storage_event_limit=default=0”
- “–event_storage_age_limit=default=0”
- “–disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network”
- “–allow_dynamic_housekeeping=true”
- “–storage_duration=1m0s”
- “–containerd=/run/docker/containerd/containerd.sock”

Could this error actually be related to the “Dynamic Host Management” as I turned the connection for a moment and then turned it on and seems to that error disappeared.

martin.hirschvogel · July 20, 2023, 4:41pm

Could very well be. What is the status now? The error is gone I assume and everything working now?

nschell · July 27, 2023, 1:07pm

Hello Hirschvogel,

we also have the issue, that we are not receiving any metrics.

Cluster collector version: 1.4.0, Nodes with container collectors: 3/3, Nodes with machine collectors: 3/3, Container Metrics: **No dataCRIT**, Machine Metrics: OK

Using the api on the nodeport (http://10.0.1.1:30035/docs#/), we are receiving the /machine_sections, as expected. /container_metrics returns an empty json array:

Our Setup (microk8s):

microk8s version

MicroK8s v1.26.6 revision 5479

kubectl version --short

Client Version: v1.26.6
Kustomize Version: v4.5.7
Server Version: v1.26.6

kubectl get nodes -o wide

NAME           STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
__fqdn__       Ready    <none>   29h   v1.26.6   10.0.3.1      <none>        Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.6.15
__fqdn__       Ready    <none>   29h   v1.26.6   10.0.2.1      <none>        Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.6.15
__fqdn__       Ready    <none>   29h   v1.26.6   10.0.1.1      <none>        Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.6.15

cat /etc/*release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

PS:
additional problems, when using ubuntu 22.04 (cgroups v2):

checkmk-node-collector-container-metrics cannot read smaps files:

kubectl logs -f checkmk-node-collector-container-metrics-6jwms -n monitoring

W0727 13:02:10.103467       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0727 13:02:10.299686       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0727 13:02:11.058789       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER

martin.hirschvogel · July 27, 2023, 1:47pm

Mikrok8s is not supported

martin.hirschvogel · November 8, 2023, 8:49am

We are now a Google Cloud partner and one of the few Autopilot partner workloads: Autopilot partner workloads

Requirements for monitoring GKE Autopilot:

Cluster Collector 1.5.1+
GKE Autopilot 1.27+ (EDIT: Tested on 1.27)

The following line in the values.yaml needs to be changed to true:

github.com

Checkmk/checkmk_kube_agent/blob/1.0.0/deploy/charts/checkmk/values.yaml#L212


      
            - "--global_housekeeping_interval=30s"
            - "--event_storage_event_limit=default=0"
            - "--event_storage_age_limit=default=0"
            - "--disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network"
            - "--allow_dynamic_housekeeping=true"
            - "--storage_duration=1m0s"
          
          # For Google Kubernetes Engine (GKE) Autopilot, the readOnly needs to be set to true
          volumeMountPermissions:
            var_run:
              readOnly: false
          
          securityContext:
            seccompProfile:
              type: RuntimeDefault
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
              add: ["SYS_PTRACE"]
            privileged: false

Happy monitoring and thanks to @morbloe for using you as a reference to convince Google that this needs to be done.

groupwhere · February 9, 2024, 1:58pm

Using Rancher, and the above fixed my metrics error. It’s slightly different:

kubectl edit -n checkmk-monitoring daemonsets.apps myrelease-checkmk-node-collector-container-metrics

        - mountPath: /var/run
          name: var-run
          readOnly: true # ADDED THIS

      containers:
      - args:
        - --housekeeping_interval=30s
        - --max_housekeeping_interval=35s
        - --event_storage_event_limit=default=0
        - --event_storage_age_limit=default=0
        - --store_container_labels=false
        - --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,io.kubernetes.pod.uid
        - --global_housekeeping_interval=30s
        - --event_storage_event_limit=default=0
        - --event_storage_age_limit=default=0
        - --disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network
        - --allow_dynamic_housekeeping=true
        - --storage_duration=1m0s
        - --containerd=/run/k3s/containerd/containerd.sock # ADDED THIS