Monitoring Kuberetes clusters - Cannot get RRD data

alexmoo · March 8, 2024, 9:25am

CMK version: 2.2.0p23
OS version: Ubuntu 22.04

**Error message: Cannot create graph

Cannot get RRD data for PROD-EKS/CPU resources/kube_cpu_request**

We start to monitor two kubernetes cluster (AWS EKS v1.27) they are configure the same.

on first cluster the kubernetes dashboard is correclty
on the second one the kubernetes dashboard missing resources(CPU resources: Cluster, MEM resources: Cluster, POD resources: Cluster)

any idea?

Norm · March 9, 2024, 2:26pm

Hi @alexmoo,

please follow this guide to see if all the necessary metric collectors are running.

https://checkmk.atlassian.net/wiki/spaces/KB/pages/9470456/How-to+collect+troubleshooting+data+for+Kubernetes#How-tocollecttroubleshootingdataforKubernetes-DebugKubernetesClusterComponents

If that’s the case, you should look a bit deeper to find the issues. Check if the metrics arrive at the Checkmk special agent. If they are missing, you should take a look at the metric creation.

Best Regards
Norm

martin.hirschvogel · March 10, 2024, 9:31am

Hey @alexmoo

the good news: the collection of the data works (as can be seen by the dashlet CPU utilization). So nothing is wrong on the side of the collectors runnig in K8s or how you set up the configuration.

the bad news: the problem is inside your Checkmk site and thus requires more info for troubleshooting. It’s a bit weird, since the dashlets (CPU/Memory utilization: Cluster) and the dashlets (CPU/Memory resources: Cluster) essentially access the same data.
Please take a look at the ~/var/log/web.log and also the rrd files ~/var/check_mk/rrd (do they exist? are they under the right user?).

andreas-doehler · March 10, 2024, 4:41pm

I can only say the same as @martin.hirschvogel it looks like only a RRD problem.
@alexmoo what version of CMK do you use? Raw or Enterprise and if Enterprise with classic PNP4Nagios RRDs or new CMC RRDs?
In booth cases please check if the rrdcached is running correctly.
If you use the Enterprise with CMC RRDs you can have a look at the core log for some clues whats the problem. If using RAW or old RRDs with enterprise you need to look inside the PNP4Nagios logs.

I think this is not weird, the dashlet fetches the current value directly from the perfdata output of the check (this is existing) what it cannot do is showing the value distribution over the time X. This would be fetched from the RRD.

martin.hirschvogel · March 10, 2024, 5:16pm

As we say in Germany, again what learned. Thanks for the lesson

alexmoo · March 14, 2024, 3:34pm

/var/check_mk/rrd/ exist but is empty

some errors on the web.log but not related to Kubernetes monitor
(attached)
web-log.txt (1.3 KB)

alexmoo · March 14, 2024, 3:47pm

We are using Enterprise and CORE: cmc
the cmc.log is full with errors:
main ones:

2024-03-14 00:07:18 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/deployment_prod-eks_datacenter-production_datacenter-production/Memory_resources_kube_memory_usage.rrd
2024-03-14 00:07:18 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/deployment_prod-eks_datacenter-production_datacenter-production/Replicas_kube_updated_replicas.rrd
2024-03-14 00:07:18 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/deployment_prod-eks_datacenter-production_datacenter-production/Check_MK_execution_time.rrd
2024-03-14 00:07:18 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/deployment_prod-eks_datacenter-production_datacenter-production/Check_MK_children_system_time.rrd
2024-03-14 00:07:19 [4] [main] [RRD helper 1679958] [log] Error creating RRD for pnp_multiple;pod_prod-eks_metabase-production_metabase-production-5bf9d46679-gxs6m;Restarts;kube_pod_restart_rate;0: Tried to create /opt/omd/sites/site1/var/pnp4nagios/perfdata/pod_prod-eks_metabase-production_metabase-production-5bf9d46679-gxs6m/Restarts_kube_pod_restart_rate.rrd, but this RRD exists.

and

/var/pnp4nagios/perfdata/node_prod-eks_ip-172-16-40-100.eu-west-1.compute.internal/Memory_nfs_unstable.rrd
2024-03-14 00:00:14 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1
/var/pnp4nagios/perfdata/node_prod-eks_ip-172-16-40-100.eu-west-1.compute.internal/Memory_pending.rrd
2024-03-14 00:00:14 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1
/var/pnp4nagios/per

alexmoo · March 14, 2024, 4:01pm

also the files here ‘/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/deployment_prod-eks_datacenter-production_datacenter-production/’ are xml not rdd

andreas-doehler · March 14, 2024, 7:14pm

In your system the rrd data is not migrated to the cmc core completely. The error message refer to old PNP4Nagios data.
At the following link you can find e informational for migrating your data.

alexmoo · March 15, 2024, 9:00am

we converted data to rdd with partly improvement
pods .rrd data is written to
sites/site1/var/check_mk/rrd and only it display correctly.

all other data(deployment, namespaces, etc) are missing
and are still insist to save date to /pnp4nagios/perfdata/

2024-03-15 09:43:22 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/namespace_prod-eks_server1-api-production/Check_MK_children_system_time.rrd
2024-03-15 09:43:22 [4] [rrdcached] [rrdcached at "/omd/sites/site1/tmp/run/rrdcached.sock"] [log] -1 No such file: /opt/omd/sites/site1/var/pnp4nagios/perfdata/namespace_prod-eks_server1-api-production/Check_MK_cmk_time_agent.rrd
2024-03-15 09:43:22 [4] [main] [RRD helper 222423] [log] Error creating RRD for pnp_multiple;namespace_prod-eks_server1-api-production;CPU resources;kube_cpu_usage;0.002718: Tried to create /opt/omd/sites/site1/var/pnp4nagios/perfdata/namespace_prod-eks_server1-api-production/CPU_resources_kube_cpu_usage.rrd, but this RRD exists.

it’s all very strange because the first cluster just work out ok
any idea?

Norm · March 15, 2024, 9:33am

Hi @alexmoo,

can you try the following steps?

Create a backup (just in case)
Stop the affected site
execute as a site user the following command: cmk -v --convert-rrds | grep WARNING | awk '{ print $3 }'| xargs rm
Start the site again
Wait and see if the RRDs get created
If the graphs are not created, please check the cmc.log for RRD messages and post them here

Hope this helps.

Best Regards
Norm

alexmoo · March 18, 2024, 8:24am

The command itself wasn’t successful:

ERROR: you must define at least one Data Source HOST: xxx

and

rm: invalid option -- '>'
Try 'rm --help' for more information.

however i use it output to delete the .xml files, and it works
we can now see the graph ok at almost all kebernetes elements.

Thanks a lot