GCP Autopilot Monitoring - kubelet service critical

CMK version: 2.3.0p29
OS version: Alma Linux 8

Hello! We are using GCP Kubernetes Autopilot Cluster and have the following issue with the Checkmk kubelet service check in all of the cluster nodes:

The details on the service page gives me a bit more info:

Not healthyCRIT
Verbose response:
{“kind”:“Status”,“apiVersion”:“v1”,“metadata”:{},“status”:“Failure”,“message”:“nodes "my-node-name" is forbidden: User "system:serviceaccount:checkmk-monitoring:checkmk-agent" cannot get resource "nodes/proxy" in API group "" at the cluster scope: GKE Warden authz [denied by managed-namespaces-limitation]: cluster scoped resource "nodes/proxy" is managed and access is denied”,“reason”:“Forbidden”,“details”:{“name”:“my-node-name”,“kind”:“nodes”},“code”:403}

Version v1.31.8-gke.1045000

According to the documentation we need to ensure the following:

  1. Cluster Collector and GKE Autopilot Version:
  • Confirm that you’re using Cluster Collector version 1.5.1 or higher and GKE Autopilot version 1.27 or higher 1.
  1. Configuration in values.yaml:
  • Ensure that you have set var_run to readOnly in the values.yaml file to satisfy the read-only permission required in GKE Autopilot:
volumeMountPermissions:
  var_run:
    readOnly: true

We already confirmed that all of this is configured. Cluster Collector is version 1.7 (newest CheckMK Agent helm deployment) and Autopilot is Version 1.31.8

Nevertheless we experience the above issue. Can you give some advice how we can proceed further to resolve the issue? Thank you very much in advance!

Best regards
Sven

1 Like

Hey Sven,
you can ignore this error for now and remove the Kubelet service from monitoring.
GKE Autopilot has changed the rights which applications have for requesting information from nodes.
Given that GKE Autopilot is hardcore managed and a Kubelet issue will likely be handled very quickly by Google Cloud themselves, we will eventually remove that service for GKE in general. But we will be looking into other options regarding permissions.
Martin

hi @martin.hirschvogel we also see error like this:

{
  "insertId": "",
  "jsonPayload": {
    "pid": "1",
    "message": "Cannot read smaps files for any PID from CONTAINER"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "pod_name": "v16-checkmk-node-collector-container-metrics-tmz2s",
      "container_name": "cadvisor",
      "cluster_name": "",
      "location": "europe-west3",
      "project_id": "",
      "namespace_name": "checkmk-monitoring"
    }
  },
  "timestamp": "2025-06-26T15:01:48.587871743Z",
  "severity": "WARNING",
  "labels": {
    "logging.gke.io/top_level_controller_name": "v16-checkmk-node-collector-container-metrics",
    "k8s-pod/app_kubernetes_io/name": "checkmk",
    "k8s-pod/app": "v16-checkmk-node-collector-container-metrics",
    "k8s-pod/pod-template-generation": "2",
    "k8s-pod/app_kubernetes_io/instance": "v16",
    "compute.googleapis.com/resource_name": "",
    "logging.gke.io/top_level_controller_type": "DaemonSet",
    "k8s-pod/controller-revision-hash": "",
    "k8s-pod/component": "v16-checkmk-node-collector"
  },
  "logName": "projects/logs/stderr",
  "sourceLocation": {
    "file": "handler.go",
    "line": "426"
  },
  "receiveTimestamp": "2025-06-26T15:01:53.572510489Z"
}

Hello, I tried a workaround by adding the following into the values.yaml of the checkmk agent helm deployment to keep the cadvisor from trying to collect smaps.

> nodeCollector:
>   cadvisor:
>     enabled: true
>     additionalArgs:
>       - --disable_metrics=memory

After trying upgrade the helm deployment with the new values.yaml file I got the following error:

Error: UPGRADE FAILED: cannot patch "mydeployment-checkmk-node-collector-container-metrics" with kind DaemonSet: admission webhook "warden-validating.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more constraints.
Violations details: {"[denied by autogke-no-write-mode-hostpath]":["hostPath volume var-run in container cadvisor is accessed in write mode; disallowed in Autopilot.","hostPath volume sys used in container cadvisor uses path /sys which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/].","hostPath volume docker used in container cadvisor uses path /var/lib/docker which is not allowed in Autopilot. Allowed path prefixes for hostPath volumes are: [/var/log/]."]}
Requested by user: 'myuser@email.com', groups: 'system:authenticated'.

Is there any other way to try and get rid of the error message?

Cannot read smaps files for any PID from CONTAINER

Thank you!
Sven

Hello Martin;
We seem to run into multiple permission issues with GKE Autopilot.
Simple config updates also seem to fail here.

Is this a supported deployment?
What are the options you are looking into?

The PID issue is not an issue. cAdvisor is just too verbose here.

That shouldn’t happen. Can you share the error messages you get? Ideally via a support ticket, then our devs can work directly in there.

GKE Autopilot is a supported version, but they made some changes to the whitelisting policy recently and our tests showed no issues, but might have been that our cluster still had partner priviledges - not always clear.

1 Like

Short update:
a) No more error messages for PID issue. We don’t need the metric anyway. Switch from disable to enable by martinhv · Pull Request #30 · Checkmk/checkmk_kube_agent · GitHub
b) Support for allowlisted workloads: GKE allowlist synchronizer by martinhv · Pull Request #33 · Checkmk/checkmk_kube_agent · GitHub
c) Configuration will be simplified: Unify GKE Autopilot settings · martinhv/checkmk_kube_agent@8a70673 · GitHub
d) kubelet service won’t be shown, if there are no permissions to gather this info.

Due to a) a new allowlist has to be rolled out soon, which takes up to 7 business days. That is currently in the review process at Google.