Kubernetes Cluster collector doesn't show CPU, memory usage or container metrics

MartinM · June 5, 2023, 9:40am

Hello everyone,

I added one of our K8S Cluster to checkMK. After that, the dashboards are filled only partial. Container metrics are missing. Maybe it is has something todo with the patched checkmk-cadvisor version?
Any ideas how to get the container metrics collection to work?

CMK version: 2.1.0p28.cee
OS version: Red Hat Enterprise Linux release 9.2

Error message: None. Empty values for container metrics

Cluster collector:
Cluster collector version: 1.4.0, Nodes with container collectors: 16/16, Nodes with machine collectors: 16/16, Container Metrics: No data, Machine Metrics: OK

Kubernetes API:
Live, Ready

Nodes:
Worker nodes 18/18, No control plane nodes found

Pod resources:
Running: 207, Pending: 0, Succeeded: 12, Failed: 0, Unknown: 0, Allocatable: 1980

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Checkmk version 2.1.0p28
Try license usage history update.
Trying to acquire lock on /omd/sites/sitename/var/check_mk/license_usage/next_run
Got lock on /omd/sites/sitename/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/sitename/var/check_mk/license_usage/history.json
Got lock on /omd/sites/sitename/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/sitename/var/check_mk/license_usage/history.json
Released lock on /omd/sites/sitename/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/sitename/var/check_mk/license_usage/next_run
Released lock on /omd/sites/sitename/var/check_mk/license_usage/next_run
+ FETCHING DATA
  Source: SourceType.HOST/FetcherType.PROGRAM
[cpu_tracking] Start [7f0aa0fba070]
[ProgramFetcher] Fetch with cache settings: DefaultAgentFileCache(k8s-he-services-qa, base_path=/omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Not using cache (Too old. Age is 43 sec, allowed is 0 sec)
[ProgramFetcher] Execute data source
Calling: /omd/sites/sitename/share/check_mk/agents/special/agent_kube --pwstore=4@0@checkmk-kube-agent-k8s-he-services-qa-token '--cluster' 'k8s-he-services-qa' '--token' '****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'namespaces' 'nodes' 'pods' '--cluster-aggregation-exclude-node-roles' 'control-plane' 'infra' '--api-server-endpoint' 'https://k8s-api-fqdn' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://checkmk-kube-agent-service-fqdn' '--cluster-collector-proxy' 'FROM_ENVIRONMENT'
Write data to cache file /omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube/k8s-he-services-qa
Trying to acquire lock on /omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube/k8s-he-services-qa
Got lock on /omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube/k8s-he-services-qa
Releasing lock on /omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube/k8s-he-services-qa
Released lock on /omd/sites/sitename/tmp/check_mk/data_source_cache/special_kube/k8s-he-services-qa
[cpu_tracking] Stop [7f0aa0fba070 - Snapshot(process=posix.times_result(user=0.020000000000000018, system=0.01999999999999999, children_user=2.23, children_system=0.21, elapsed=2.9500000001862645))]
  Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7f0aa0fba370]
[PiggybackFetcher] Fetch with cache settings: NoCache(k8s-he-services-qa, base_path=/omd/sites/sitename/tmp/check_mk/data_source_cache/piggyback, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=True, use_outdated=False, simulation=False)
Not using cache (Cache usage disabled)
[PiggybackFetcher] Execute data source
No piggyback files for 'k8s-he-services-qa'. Skip processing.
Not using cache (Cache usage disabled)
[cpu_tracking] Stop [7f0aa0fba370 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
+ PARSE FETCHER RESULTS
  Source: SourceType.HOST/FetcherType.PROGRAM
<<<kube_pod_resources_v1:sep(0)>>> / Transition NOOPParser -> HostSectionParser
<<<kube_allocatable_pods_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_node_count_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_cluster_details_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_memory_resources_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_cpu_resources_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_allocatable_memory_resource_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_allocatable_cpu_resource_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_cluster_info_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
<<<kube_collector_daemons_v1:sep(0)>>> / Transition HostSectionParser -> HostSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') / Transition HostSectionParser -> PiggybackParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_node_container_count_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_node_kubelet_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_pod_resources_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_allocatable_pods_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_node_info_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_cpu_resources_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_memory_resources_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_allocatable_cpu_resource_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_allocatable_memory_resource_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_node_conditions_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
PiggybackMarker(hostname='node_k8s-he-services-qa_k8s-he-qa-121-f59ae0-master-0') SectionMarker(name=SectionName('kube_node_custom_conditions_v1'), cached=None, encoding='utf-8', nostrip=False, persist=None, separator='\x00') / Transition PiggybackSectionParser -> PiggybackSectionParser
Transition PiggybackSectionParser -> NOOPParser
[...]
Received piggyback data for 291 hosts
[cpu_tracking] Start [7f0aa0e1e940]
value store: synchronizing
Trying to acquire lock on /omd/sites/sitename/tmp/check_mk/counters/k8s-he-services-qa
Got lock on /omd/sites/sitename/tmp/check_mk/counters/k8s-he-services-qa
value store: loading from disk
Releasing lock on /omd/sites/sitename/tmp/check_mk/counters/k8s-he-services-qa
Released lock on /omd/sites/sitename/tmp/check_mk/counters/k8s-he-services-qa
CPU resources        Requests: 85.674 (242/302 containers with requests), Limits: 140.200 (219/302 containers with limits), Allocatable: 136.000
Cluster collector    Cluster collector version: 1.4.0, Nodes with container collectors: 16/16, Nodes with machine collectors: 16/16, Container Metrics: No data, Machine Metrics: OK
Info                 Name: k8s-he-services-qa
Kubernetes API       Live, Ready
Memory resources     Requests: 160 GiB (226/302 containers with requests), Limits: 240 GiB (225/302 containers with limits), Allocatable: 248 GiB
Nodes                Worker nodes 18/18, No control plane nodes found
Pod resources        Running: 207, Pending: 0, Succeeded: 12, Failed: 0, Unknown: 0, Allocatable: 1980
No piggyback files for 'k8s-he-services-qa'. Skip processing.
[cpu_tracking] Stop [7f0aa0e1e940 - Snapshot(process=posix.times_result(user=0.010000000000000009, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.009999999776482582))]
[special_kube] Success, execution time 3.0 sec | execution_time=2.960 user_time=0.030 system_time=0.020 children_user_time=2.230 children_system_time=0.210 cmk_time_ds=0.470 cmk_time_agent=0.000

Trying the API with curl:

/machine_sections work

# curl -k -H "Authorization: Bearer TOKEN" https://checkmk-kube-agent-service-fqdn/machine_sections | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  183k  100  183k    0     0  3747k      0 --:--:-- --:--:-- --:--:-- 3825k
[
  {
    "node_name": "k8s-he-qa-121-f59ae0-worker-1",
    "sections": "<<<check_mk>>>\nVersion: 2.1.0-latest\nAgentOS: kube 1.4.0\n<<<kernel>>>
	[...]

/container_metrics is empty

#   curl -k -H "Authorization: Bearer TOKEN" https://checkmk-kube-agent-service-fqdn/container_metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0     50      0 --:--:-- --:--:-- --:--:--    51
[]
#

checkmk-kube-agent-node-collector-container-metrics-5m58n.log

E0602 06:05:42.452838       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-pod5ba9ec62_91c2_4528_afdb_bded7ccecd46.slice/docker-458e23e52f451decddce451635ec2ee18f95c69798964fec563b070fd8f9f455.scope: failed to identify the read-write layer ID for container "458e23e52f451decddce451635ec2ee18f95c69798964fec563b070fd8f9f455". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/458e23e52f451decddce451635ec2ee18f95c69798964fec563b070fd8f9f455/mount-id: no such file or directory
E0602 06:05:42.453407       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-pod3cf8dbf0_9880_486e_9e8d_345835721445.slice/docker-c70ea1354689fc99fc795c42635a8abf9bbf18d9d8cd6ee5f53228a00d789cdd.scope: failed to identify the read-write layer ID for container "c70ea1354689fc99fc795c42635a8abf9bbf18d9d8cd6ee5f53228a00d789cdd". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/c70ea1354689fc99fc795c42635a8abf9bbf18d9d8cd6ee5f53228a00d789cdd/mount-id: no such file or directory
E0602 06:05:42.453843       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod07aaf76f_7a20_4d33_b158_db44763507de.slice/docker-90980f0cdc979cd022aad8ff973c31727ba9572159d5ffd0981571ca2c95fd31.scope: failed to identify the read-write layer ID for container "90980f0cdc979cd022aad8ff973c31727ba9572159d5ffd0981571ca2c95fd31". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/90980f0cdc979cd022aad8ff973c31727ba9572159d5ffd0981571ca2c95fd31/mount-id: no such file or directory
E0602 06:05:42.454557       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod979b499a_3432_4065_bc40_04203ef4c43d.slice/docker-c61135ef5bf7d338e90543fc1afeb1bcaf18763e3d0c1a3b499a4465335287a0.scope: failed to identify the read-write layer ID for container "c61135ef5bf7d338e90543fc1afeb1bcaf18763e3d0c1a3b499a4465335287a0". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/c61135ef5bf7d338e90543fc1afeb1bcaf18763e3d0c1a3b499a4465335287a0/mount-id: no such file or directory
E0602 06:05:42.455196       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf058ff8e_bf9d_454c_8ecd_9ac293cdaeff.slice/docker-cbc2f8d3d762ffd5a5a4d9d3c04fff04c840aca48b49c3c27e0f8b1f940181b4.scope: failed to identify the read-write layer ID for container "cbc2f8d3d762ffd5a5a4d9d3c04fff04c840aca48b49c3c27e0f8b1f940181b4". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/cbc2f8d3d762ffd5a5a4d9d3c04fff04c840aca48b49c3c27e0f8b1f940181b4/mount-id: no such file or directory
E0602 06:05:42.455676       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod979b499a_3432_4065_bc40_04203ef4c43d.slice/docker-95b2bdd1ce3ed529734aa9c065e6d256b67b9927883d4a429bdb0e0f99dd6c10.scope: failed to identify the read-write layer ID for container "95b2bdd1ce3ed529734aa9c065e6d256b67b9927883d4a429bdb0e0f99dd6c10". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/95b2bdd1ce3ed529734aa9c065e6d256b67b9927883d4a429bdb0e0f99dd6c10/mount-id: no such file or directory
E0602 06:05:42.456147       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda6f531b1_f84f_49f3_bb55_0c6bfc285a23.slice/docker-c6c6e37689a4fcf59ef66c26246913c34ee60953da8ae2d044a3d8f5cf56b6ef.scope: failed to identify the read-write layer ID for container "c6c6e37689a4fcf59ef66c26246913c34ee60953da8ae2d044a3d8f5cf56b6ef". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/c6c6e37689a4fcf59ef66c26246913c34ee60953da8ae2d044a3d8f5cf56b6ef/mount-id: no such file or directory
E0602 06:05:42.456811       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc9643256_7b80_42e9_beb9_c22d6f63f6f9.slice/docker-9a1f906d92cdbdbc1d7a0b09318a8eb4bf51b24c541b2fb91504e65a373cb448.scope: failed to identify the read-write layer ID for container "9a1f906d92cdbdbc1d7a0b09318a8eb4bf51b24c541b2fb91504e65a373cb448". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/9a1f906d92cdbdbc1d7a0b09318a8eb4bf51b24c541b2fb91504e65a373cb448/mount-id: no such file or directory
E0602 06:05:42.457417       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod07aaf76f_7a20_4d33_b158_db44763507de.slice/docker-aedfafdd30874682f847af47b1429ad897775b6fe25d797ccd47ba19d341762e.scope: failed to identify the read-write layer ID for container "aedfafdd30874682f847af47b1429ad897775b6fe25d797ccd47ba19d341762e". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/aedfafdd30874682f847af47b1429ad897775b6fe25d797ccd47ba19d341762e/mount-id: no such file or directory
E0602 06:05:42.458164       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf058ff8e_bf9d_454c_8ecd_9ac293cdaeff.slice/docker-8ecf305f38c24595f51788abb51166a3f81a24de35eeeaee76c0259bec801eb0.scope: failed to identify the read-write layer ID for container "8ecf305f38c24595f51788abb51166a3f81a24de35eeeaee76c0259bec801eb0". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/8ecf305f38c24595f51788abb51166a3f81a24de35eeeaee76c0259bec801eb0/mount-id: no such file or directory
E0602 06:05:42.458744       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod368af66c_749d_4d90_a34b_7ea5b0d9ef9f.slice/docker-affd6d5994d109898201b6566a797f165c87b7e1b79d1e89d32a623115878a25.scope: failed to identify the read-write layer ID for container "affd6d5994d109898201b6566a797f165c87b7e1b79d1e89d32a623115878a25". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/affd6d5994d109898201b6566a797f165c87b7e1b79d1e89d32a623115878a25/mount-id: no such file or directory
E0602 06:05:42.465280       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod368af66c_749d_4d90_a34b_7ea5b0d9ef9f.slice/docker-4e1a042075120ff0906cb3d8ff687ef5593f3643c3afede5d90b58c8c575ee7b.scope: failed to identify the read-write layer ID for container "4e1a042075120ff0906cb3d8ff687ef5593f3643c3afede5d90b58c8c575ee7b". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/4e1a042075120ff0906cb3d8ff687ef5593f3643c3afede5d90b58c8c575ee7b/mount-id: no such file or directory
E0602 06:05:42.465943       1 manager.go:1123] Failed to create existing container: /kubepods.slice/kubepods-podfd41f077_c341_467d_be96_c4fd846a2fb0.slice/docker-24f35c3b9784947bfedf5c5aca6666deecaeb547701453b093d9583ecb492381.scope: failed to identify the read-write layer ID for container "24f35c3b9784947bfedf5c5aca6666deecaeb547701453b093d9583ecb492381". - open /var/nutanix/docker/image/overlay2/layerdb/mounts/24f35c3b9784947bfedf5c5aca6666deecaeb547701453b093d9583ecb492381/mount-id: no such file or directory

r.sander · June 5, 2023, 12:37pm

Have you configured the “Dynamic host management” to automatically create hosts for namespaces, daemonsets etc?

MartinM · June 5, 2023, 1:16pm

Yes. Currently 291 hosts are created and the services are discovered:

chauhan_sudhir · June 5, 2023, 2:42pm

Is this managed K8 or on-prem K8 ?
Also, which version ?
Please also share

kubectl get all -n

MartinM · June 5, 2023, 3:00pm

Hello

This is a on-prem K8S Cluster build by Nutanix.

kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:57:43Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12", GitCommit:"696a9fdd2a58340e61e0d815c5769d266fca0802", GitTreeState:"clean", BuildDate:"2022-04-13T19:01:10Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

kubectl get all --namespace checkmk
NAME                                                            READY   STATUS    RESTARTS   AGE
pod/checkmk-kube-agent-cluster-collector-66d9875b48-zsxz7       1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-container-metrics-2v55r   2/2     Running   0          29m
pod/checkmk-kube-agent-node-collector-container-metrics-4dlbw   2/2     Running   0          32m
pod/checkmk-kube-agent-node-collector-container-metrics-5dfmt   2/2     Running   0          38m
pod/checkmk-kube-agent-node-collector-container-metrics-65vmx   2/2     Running   0          28m
pod/checkmk-kube-agent-node-collector-container-metrics-7zbbc   2/2     Running   0          37m
pod/checkmk-kube-agent-node-collector-container-metrics-8mrvm   2/2     Running   0          30m
pod/checkmk-kube-agent-node-collector-container-metrics-b6pnx   2/2     Running   0          25m
pod/checkmk-kube-agent-node-collector-container-metrics-db22l   2/2     Running   0          33m
pod/checkmk-kube-agent-node-collector-container-metrics-dx6h2   2/2     Running   0          26m
pod/checkmk-kube-agent-node-collector-container-metrics-nt6bz   2/2     Running   0          38m
pod/checkmk-kube-agent-node-collector-container-metrics-pknm5   2/2     Running   0          31m
pod/checkmk-kube-agent-node-collector-container-metrics-qjq7s   2/2     Running   0          35m
pod/checkmk-kube-agent-node-collector-container-metrics-s7gwb   2/2     Running   0          31m
pod/checkmk-kube-agent-node-collector-container-metrics-thnps   2/2     Running   0          36m
pod/checkmk-kube-agent-node-collector-container-metrics-ww2zs   2/2     Running   0          34m
pod/checkmk-kube-agent-node-collector-container-metrics-zdj6s   2/2     Running   0          27m
pod/checkmk-kube-agent-node-collector-machine-sections-2l9bq    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-6742d    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-cmwjh    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-ft2x7    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-gm5rv    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-gxjfv    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-hvxgt    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-hwpqz    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-n9vl4    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-ndnth    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-pvsb6    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-qm2bb    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-rz7qm    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-sr7d7    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-wcg86    1/1     Running   0          4d7h
pod/checkmk-kube-agent-node-collector-machine-sections-zzrbj    1/1     Running   0          4d7h

NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/checkmk-kube-agent-cluster-collector   ClusterIP   172.19.96.114   <none>        8080/TCP   4d7h

NAME                                                                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/checkmk-kube-agent-node-collector-container-metrics   16        16        16      16           16          <none>          4d7h
daemonset.apps/checkmk-kube-agent-node-collector-machine-sections    16        16        16      16           16          <none>          4d7h

NAME                                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/checkmk-kube-agent-cluster-collector   1/1     1            1           4d7h

NAME                                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/checkmk-kube-agent-cluster-collector-66d9875b48   1         1         1       4d7h

chauhan_sudhir · June 5, 2023, 3:22pm

What is your container runtime and under which socket path its available?

MartinM · June 6, 2023, 7:30am

According to the manufacturer Nutanix, in the current environment (K8S 1.21) it uses docker. As of 1.22 this is containerd

chauhan_sudhir · June 6, 2023, 7:32am

Can you hare the output of the following?

kubectl get nodes -o wide

Also, please login to one of the node and share the output of the following:

find /var/run |grep containerd.sock
find /run |grep containerd.sock

MartinM · June 6, 2023, 8:51am

PS C:\Users\username> kubectl get nodes -o wide
NAME                                       STATUS   ROLES    AGE    VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
k8s-he-auth-sec-prod-121-6f4b0d-master-0   Ready    master   279d   v1.21.12   10.50.251.174   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-master-1   Ready    master   279d   v1.21.12   10.50.251.175   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-worker-0   Ready    node     279d   v1.21.12   10.50.251.171   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-worker-1   Ready    node     279d   v1.21.12   10.50.251.220   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-worker-2   Ready    node     279d   v1.21.12   10.50.251.213   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-worker-3   Ready    node     279d   v1.21.12   10.50.251.187   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-worker-4   Ready    node     279d   v1.21.12   10.50.251.224   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13

[root@k8s-he-qa-121-f59ae0-worker-0 ~]# find /run |grep containerd.sock
/run/docker/containerd/containerd.sock
/run/docker/containerd/containerd.sock.ttrpc
[root@k8s-he-qa-121-f59ae0-worker-0 ~]# find /var/run |grep containerd.sock
[root@k8s-he-qa-121-f59ae0-worker-0 ~]#

chauhan_sudhir · June 6, 2023, 9:16am

Can you modify the daemonset checkmk-node-collector-container-metrics and the following to
nodeCollector:
cadvisor:
additionalArgs:

“–containerd=/run/docker/containerd/containerd.sock”

MartinM:

PS C:\Users\username> kubectl get nodes -o wide
NAME                                       STATUS   ROLES    AGE    VERSION    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
k8s-he-auth-sec-prod-121-6f4b0d-master-0   Ready    master   279d   v1.21.12   10.50.251.174   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13
k8s-he-auth-sec-prod-121-6f4b0d-master-1   Ready    master   279d   v1.21.12   10.50.251.175   <none>        CentOS Linux 7 (Core)   3.10.0-1160.71.1.el7.x86_64   docker://20.10.13

You can also include the controle plane nodes in the calculation by using the following in the rule:
Cluster resource aggregation → Include all nodes

MartinM · June 6, 2023, 1:11pm

I’ve added that. The pods are newly deployed after the change. But it makes no difference. No data available

[...]
      containers:
        - name: cadvisor
          image: checkmk/cadvisor-patched:1.4.0
          command:
            - /usr/bin/cadvisor
          args:
            - '--housekeeping_interval=30s'
            - '--max_housekeeping_interval=35s'
            - '--event_storage_event_limit=default=0'
            - '--event_storage_age_limit=default=0'
            - '--store_container_labels=false'
            - '--whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,io.kubernetes.pod.uid'
            - '--global_housekeeping_interval=30s'
            - '--event_storage_event_limit=default=0'
            - '--event_storage_age_limit=default=0'
            - '--disable_metrics=percpu,process,sched,tcp,udp,diskIO,disk,network'
            - '--allow_dynamic_housekeeping=true'
            - '--storage_duration=1m0s'
            - '--containerd=/run/docker/containerd/containerd.sock'
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
[...]

For further testing I added a second cluster to checkmk. It uses v1.24.7 and shows the values directly (without adding the containerd.sock path to the deamonset) in the dashboard:

chauhan_sudhir · June 6, 2023, 1:45pm

What does

kubectl get nodes -o wide

show with the new cluster ?

martin.hirschvogel · June 6, 2023, 4:19pm

1.21 was released in Apr 2021 I think and has been EOL since ~Apr 2022 then. K8s is changing quite a lot with each version, so many things have to be adapted on our side as well all the time.
Thus, we make sure that the Kubernetes monitoring works at least for the latest 3 K8s versions.

Considering that 1.21 is so fay beyond EOL, can’t you just update your K8s clusters?

MartinM · June 7, 2023, 7:24am

kubectl get nodes -o wide
NAME                                  STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
k8s-he-jfrog-qa-124-3db49b-master-0   Ready    master   98d   v1.24.7   10.50.251.108   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15
k8s-he-jfrog-qa-124-3db49b-master-1   Ready    master   98d   v1.24.7   10.50.251.110   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15
k8s-he-jfrog-qa-124-3db49b-worker-0   Ready    node     98d   v1.24.7   10.50.251.113   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15
k8s-he-jfrog-qa-124-3db49b-worker-1   Ready    node     98d   v1.24.7   10.50.251.141   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15
k8s-he-jfrog-qa-124-3db49b-worker-2   Ready    node     98d   v1.24.7   10.50.251.112   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15
k8s-he-jfrog-qa-124-3db49b-worker-3   Ready    node     98d   v1.24.7   10.50.251.114   <none>        CentOS Linux 7 (Core)   3.10.0-1160.81.1.el7.x86_64   containerd://1.6.15

MartinM · June 7, 2023, 7:29am

That is okay for me. A hint in the docs would be nice that e.g. 1.22 orsomething is the last supported version.

Not instantly. It is planned for the next months.

But even with the 1.24 cluster I don’t get every dashlet filled with data

martin.hirschvogel · June 7, 2023, 8:35am

Ah, I thought we added a note to the docs already. Will forward it to the team. Thanks for the hint.
Typically we also announce deprecations and supported versions via our werks (KUBE: Supported versions of Kubernetes are 1.22, 1.23 1.24, 1.25 and 1.26)

The only dashlet missing data is the cluster problems. The reason is that it relies on data from the hardware/software inventory. Either that was not triggered (runs every 4h) or someone deactivated it in your environments (on default it is running for the label kubernetes:yes - which is automatically added and seems to be done because if label discovery would be broken, your entire dashboard would be empty because a lot of the stuff relies on labels there)

MartinM · June 7, 2023, 12:29pm

That was the final hint. I setup a new site for k8s monitoring and did not explicitly disabled the hw/sw inventory. After adding a rule for doing hw/sw inventory the data is collected and shown in the dashboard! Maybe that should also be mentioned in the docs?!

Thank you all for the support!

martin.hirschvogel · June 7, 2023, 2:20pm

The first rule which you show is the auto-created rule. That should have been sufficient. Was it not?

MartinM · June 7, 2023, 2:38pm

It wasn’t there. I added it after your post.

martin.hirschvogel · June 7, 2023, 3:19pm

Ok, that’s indeed weird and a good point that we add something on requirements (e.g. these rules etc. are required for it to work). I mean, they should be there by default, but it’s good for troubleshooting.