Checkmk_kube_agent -> Cannot read smaps files for any PID from CONTAINER

CMK version:

  • 2.1 stable
  • checkmk/cadvisor-patched:main_2022.03.02 und checkmk/cadvisor-patched:main_2022.05.26

der erste cadvisor-patched ist aus dem offiziellen Github Helm, main_2022.05.26 habe ich probiert umd zu schauen, ob der Bug ggf. bereits gefixed wurde

OS version:
Debian Bullseye

Error message:

Logs checkmk-node-collector-container-metrics-XXXXX:

cadvisor W0527 07:45:14.605797       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:16.577284       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:17.412378       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:17.535796       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:18.152667       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:18.233585       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:18.240729       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:18.478526       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:20.215751       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:20.254583       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:20.349195       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
cadvisor W0527 07:45:21.366708       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         
container-metrics-collector INFO:     2022-05-27 07:45:21,821 - Shut down gracefully                                                                                                                                              
cadvisor I0527 07:45:21.822206       1 manager.go:1193] Exiting thread watching subcontainers                                                                                                                                     
cadvisor I0527 07:45:21.822243       1 manager.go:407] Exiting global housekeeping thread                                                                                                                                         
cadvisor I0527 07:45:21.823160       1 cadvisor.go:210] Exiting given signal: terminated     

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Was ist die version deines Kubernetes?
Ist es on-premise oder managed?
Welche Container-Runtime verwendest du?

Hast du die k8-monitoring wie hier beschrieben eingerichtet? Kubernetes überwachen

  • v1.22.7
  • on premise
  • containerd
  • ja, monitoring funktioniert auch soweit bis auf die metrics

cadvisor W0527 07:45:20.349195 1 handler.go:426] Cannot read smaps files for any PID from CONTAINER

Kannst du dich in den cadvisor Container einloggen und “cat /proc//smaps” ausführen?
Ich bin nicht in der Lage, das Problem zu reproduzieren. Verwendest du PSP in deinem Setup? Im cadvisor-Container fügen wir die capability CAP_SYS_TRACE hinzu und lassen alle fallen. Mit dieser capability sollte der Zugriff auf /proc/PID/smaps möglich sein.

PSP ist an, habe die normale da den standard aus der values.yaml genommen:

/ # cat /proc/smaps
cat: can't open '/proc/smaps': No such file or directory

Ohne geht es aber :wink: Vielen Dank

Es sollte /proc/PID/smaps sein.
PID = Dies ist ein Prozess-ID, die im Grunde eine Zahl wie 1,30 usw.

/ # ls /proc/PID
ls: /proc/PID: No such file or directory

Oh, funktioniert doch noch nicht, die UI vom Dashboard hatte sich etwas geändert, deswegen dachte ich, dass das die Lösung wäre.

Meine values.yaml:

image:
  # Overrides the image tag whose default is the chart appVersion.
  # ref: https://hub.docker.com/r/checkmk/kubernetes-collector/tags
  tag: "main_2022.03.02" # main_<YYYY.MM.DD>

rbac:
  pspEnabled: false

networkPolicy:
  enabled: false
  allowIngressFromCIDRs: []
  egressKubeApiserver:
    enableCidrLookup: true

## Configuration for cluster-collector
clusterCollector:
  image:
    repository: myregistry.tld/checkmk/kubernetes-collector
    pullPolicy: IfNotPresent

  # can be: "debug", "info", "warning" (default), "critical"
  logLevel: warning

  podAnnotations:
    seccomp.security.alpha.kubernetes.io/pod: runtime/default

  podSecurityContext: {}
    # fsGroup: 2000

  securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL
    privileged: false
    readOnlyRootFilesystem: true
    runAsGroup: 10001
    runAsNonRoot: true
    runAsUser: 10001

  service:
    # if required specify "NodePort" here to expose the cluster-collector via the "nodePort" specified below
    type: ClusterIP
    port: 8080
    # nodePort: 30035

  ingress:
    enabled: true
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/rewrite-target: /$2
    hosts:
      - host: services.internal
        paths:
          - path: /checkmk(/|$)(.*)
            pathType: Prefix
    tls:
      - secretName: services-secret
        hosts:
          - services.internal

  livenessProbe:
    initialDelaySeconds: 3
    periodSeconds: 10
    timeoutSeconds: 2
    failureThreshold: 3

  resources:
    limits:
      cpu: 300m
      memory: 200Mi
    requests:
      cpu: 150m
      memory: 200Mi

  nodeSelector: {}

  tolerations: []

  affinity: {}


## Configuration for node-collector components (cadvisor, container-metrics, machine-sections)
nodeCollector:
  # logLevel for container-metrics and machine-sections; can be: "debug", "info", "warning" (default), "critical"
  logLevel: debug

  # Pods of nodeCollectors will typically be ready for a short amount of time before detecting
  # problems. The value below ensures, that they don't become available as well.
  minReadySeconds: 15

  # Annotations to be added to node-collector pods
  podAnnotations:
    seccomp.security.alpha.kubernetes.io/pod: runtime/default

  podSecurityContext: {}
    # fsGroup: 2000

  ## Assign a nodeSelector if operating a hybrid cluster
  ##
  nodeSelector: {}
  #   beta.kubernetes.io/arch: amd64
  #   beta.kubernetes.io/os: linux

  tolerations: []
  # - effect: NoSchedule
  #   operator: Exists

  ## Assign a PriorityClassName to pods if set
  # priorityClassName: ""

  cadvisor:
    image:
      repository: myregistry.tld/checkmk/cadvisor-patched
      pullPolicy: IfNotPresent

    additionalArgs:
      - "--housekeeping_interval=30s"
      - "--max_housekeeping_interval=35s"
      - "--event_storage_event_limit=default=0"
      - "--event_storage_age_limit=default=0"
      - "--store_container_labels=false"
      - "--whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,io.kubernetes.pod.uid"
      - "--global_housekeeping_interval=30s"
      - "--event_storage_event_limit=default=0"
      - "--event_storage_age_limit=default=0"
      - "--disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network"
      - "--allow_dynamic_housekeeping=true"
      - "--storage_duration=1m0s"

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
        add: ["CAP_SYS_PTRACE"]
      privileged: false
      readOnlyRootFilesystem: true

    resources:
      limits:
        cpu: 300m
        memory: 200Mi
      requests:
        cpu: 150m
        memory: 200Mi

  containerMetricsCollector:
    image:
      repository: myregistry.tld/checkmk/kubernetes-collector
      pullPolicy: IfNotPresent

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 10001
      runAsNonRoot: true
      runAsUser: 10001

    resources:
      limits:
        cpu: 300m
        memory: 200Mi
      requests:
        cpu: 150m
        memory: 200Mi

  machineSectionsCollector:
    image:
      repository: myregistry.tld/checkmk/kubernetes-collector
      pullPolicy: IfNotPresent

    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      privileged: false
      readOnlyRootFilesystem: true
      runAsGroup: 10001
      runAsNonRoot: true
      runAsUser: 10001

    resources:
      limits:
        cpu: 300m
        memory: 200Mi
      requests:
        cpu: 150m
        memory: 200Mi

Selbst wenn ich im DaemonSet im Container cAdvisor den securityContext: entferne, erhalte ich in der shell keinen Zugriff auf /proc/PID/

/ # ls /proc/
1              buddyinfo      cpuinfo        driver         fs             kallsyms       kpagecgroup    meminfo        net            schedstat      swaps          timer_list     vmstat
22             bus            crypto         dynamic_debug  interrupts     kcore          kpagecount     misc           pagetypeinfo   self           sys            tty            zoneinfo
34             cgroups        devices        execdomains    iomem          key-users      kpageflags     modules        partitions     slabinfo       sysrq-trigger  uptime
acpi           cmdline        diskstats      fb             ioports        keys           loadavg        mounts         pressure       softirqs       sysvipc        version
asound         consoles       dma            filesystems    irq            kmsg           locks          mtrr           sched_debug    stat           thread-self    vmallocinfo
/ # find /proc -name "PID*"
/ # 

Hier wird die Prozess-ID aufgelistet. Siehst due sich die Nummern 1,22 und 34 an. Jetzt solltest du in der Lage sein, cat /proc/22/smaps auszuführen?

Der Container wirft immernoch:

cadvisor W0528 08:52:06.005327       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER                                                                                                                         

Allerdings funktioniert ein cat /proc/30/smaps/:

/ # cat /proc/30/smaps
556b0b437000-556b0b443000 r--p 00000000 08:01 2891404                    /bin/busybox
Size:                 48 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  48 kB
Pss:                  24 kB
Shared_Clean:         48 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:           48 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0
VmFlags: rd mr mw me dw sd 
...

Auch ein Wechsel zu cadvisor-patched:101 (Docker Hub) half nicht.

CheckMK version: 2.1
OS: Debian 11
Kubernetes/K3s version: v1.23.8+k3s2
CheckMK_kube_agent: v1.1.0

Im able to read the smaps file when i enter the pod/container, but receive this error in the logs

W0929 11:54:49.980403       1 manager.go:159] Cannot detect current cgroup on cgroup v2
W0929 11:54:55.056123       1 machine_libipmctl.go:64] There are no NVM devices!
W0929 11:54:55.059217       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0929 11:54:55.061511       1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
W0929 11:54:55.064681       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0929 11:54:55.066200       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0929 11:54:55.066738       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0929 11:54:55.067114       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0929 11:54:55.068076       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER
W0929 11:54:55.068325       1 handler.go:426] Cannot read smaps files for any PID from CONTAINER

Seems to be an upstream bug Log flooded by "Cannot read smaps files for any PID from CONTAINER" · Issue #3139 · google/cadvisor · GitHub

Wir haben das gleiche Problem mit rke2. Ist geplant, die neue Version im Dockerfile zu installieren?

Right now, fixed temporarily by adding --disable_metrics=referenced_memory in nodeCollector.cadvisor.additionalArgs, as suggested in the cadvisor issue.