checkMK k8s node metrics collector

**CMK version:2.2.16
**OS version:CentOS7

Constant issues with k8s node metrics collector on physical k8s nodes with timeout error:


  File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>
    sys.exit(main_machine_sections())
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
    worker(session, cluster_collector_base_url, headers, verify)
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker
    returncode = process.wait(5)
  File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds

Hello Andrew,

Of course your should make sure that you use the latest helm charts from us if possible.
What Kubernetes version and distribution are you using?

If I were you I would try to get a shell in the container to debug further like this:

     1	kubectl \
     2		-n cmkmon \
     3		exec \
     4		-it \
     5		myrelase1-checkmk-node-collector-container-metrics-vrrlj \
     6		-c container-metrics-collector -- /bin/ash

Same in one line

kubectl -n cmkmon exec -it myrelase1-checkmk-node-collector-container-metrics-vrrlj -c container-metrics-collector -- /bin/ash

You have to adjust line2 to match your namespace and line 6 to match the name of one of your collector-container-metrics pods.

Once you got a shell try to run the check_mk_agent manually and see what happens

/ $ /usr/local/bin/check_mk_agent
<<<check_mk>>>
Version: 2.2.0p12
.....

Just in the moment I do not know how to run /bin/sh scripts like with /bin/bash -x
but at least the real agent that is called by the /usr/local/bin/check_mk_agent wrapper knows the -d (debug) switch.

/usr/local/bin/check_mk_agent.openwrt -d

When I read CentOS7 I mediately think of SELinux, you may want to try to check/disable SELinux on the Kubernetes hosts.

Good luck
KR Jodok

@hactarr
Were you able to identify the problem?

Hi, I have the same problem here. At the moment with Kubernetes 1.24.6 (before upgrading it, I tried to upgrade anything else). Helm Chart version is the newest (1.5.1), came from 1.4.1 and had no issues there.
I also have two clusters running checkmk clustercollector without any issue in version 1.5.1 and the systems should be identical.
Any ideas?

Stream closed EOF for checkmk-monitoring/checkmk-clustercollector-node-collector-machine-sections-fwdhk (machine-sections-collector)                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 Traceback (most recent call last):                                                                                                                                       │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>                                                                                          │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     sys.exit(main_machine_sections())                                                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main                                                                  │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     worker(session, cluster_collector_base_url, headers, verify)                                                                                                         │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker                                                │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     returncode = process.wait(5)                                                                                                                                         │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait                                                                                                     │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     return self._wait(timeout=timeout)                                                                                                                                   │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     raise TimeoutExpired(self.args, timeout)                                                                                                                             │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds

Thanks a lot!

Hello @jodok.glabasna, @hactarr ,

I have exactly the same error and have been able to find out what the problem is. Currently the file src/checkmk_kube_agent/send_metrics.py does use the Popen from subprocess in combination with wait(5). However, in the documentation and in a stackoverflow article it is not recommended to use this because it can lead to deadlocks. Therefore I created a pull request with the change who use communicate(timeout=5) which prevents the deadlock case.

Greetings
schmidax

2 Likes

Hello,

after 20 days without action, I am now pinging @martin.hirschvogel
This Issue is on all my RKE2-Cluster (v1.27.10+rke2r1) and can be fixed with this pull request

Greetings
schmidax

1 Like

Hey schmidax,

thanks for PR and great contribution! Have forwarded it to the team.

Cheers, Martin

The latest relese of checkmk_kube_agent 1.6.0 contains the pull requests.
The relevant changes are: