checkMK k8s node metrics collector

hactarr · December 20, 2023, 6:11pm

**CMK version:2.2.16
**OS version:CentOS7

Constant issues with k8s node metrics collector on physical k8s nodes with timeout error:


  File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>
    sys.exit(main_machine_sections())
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
    worker(session, cluster_collector_base_url, headers, verify)
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker
    returncode = process.wait(5)
  File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait
    return self._wait(timeout=timeout)
  File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds

jodok.glabasna · December 21, 2023, 11:32am

Hello Andrew,

Of course your should make sure that you use the latest helm charts from us if possible.
What Kubernetes version and distribution are you using?

If I were you I would try to get a shell in the container to debug further like this:

     1	kubectl \
     2		-n cmkmon \
     3		exec \
     4		-it \
     5		myrelase1-checkmk-node-collector-container-metrics-vrrlj \
     6		-c container-metrics-collector -- /bin/ash

Same in one line

kubectl -n cmkmon exec -it myrelase1-checkmk-node-collector-container-metrics-vrrlj -c container-metrics-collector -- /bin/ash

You have to adjust line2 to match your namespace and line 6 to match the name of one of your collector-container-metrics pods.

Once you got a shell try to run the check_mk_agent manually and see what happens

/ $ /usr/local/bin/check_mk_agent
<<<check_mk>>>
Version: 2.2.0p12
.....

Just in the moment I do not know how to run /bin/sh scripts like with /bin/bash -x
but at least the real agent that is called by the /usr/local/bin/check_mk_agent wrapper knows the -d (debug) switch.

/usr/local/bin/check_mk_agent.openwrt -d

When I read CentOS7 I mediately think of SELinux, you may want to try to check/disable SELinux on the Kubernetes hosts.

Good luck
KR Jodok

jodok.glabasna · January 8, 2024, 8:42am

@hactarr
Were you able to identify the problem?

atomique · February 7, 2024, 7:33am

Hi, I have the same problem here. At the moment with Kubernetes 1.24.6 (before upgrading it, I tried to upgrade anything else). Helm Chart version is the newest (1.5.1), came from 1.4.1 and had no issues there.
I also have two clusters running checkmk clustercollector without any issue in version 1.5.1 and the systems should be identical.
Any ideas?

Stream closed EOF for checkmk-monitoring/checkmk-clustercollector-node-collector-machine-sections-fwdhk (machine-sections-collector)                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 Traceback (most recent call last):                                                                                                                                       │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>                                                                                          │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     sys.exit(main_machine_sections())                                                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main                                                                  │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     worker(session, cluster_collector_base_url, headers, verify)                                                                                                         │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker                                                │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     returncode = process.wait(5)                                                                                                                                         │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait                                                                                                     │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     return self._wait(timeout=timeout)                                                                                                                                   │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4   File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait                                                                                                    │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4     raise TimeoutExpired(self.args, timeout)                                                                                                                             │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds

Thanks a lot!

schmidax · March 6, 2024, 12:15pm

Hello @jodok.glabasna, @hactarr ,

I have exactly the same error and have been able to find out what the problem is. Currently the file src/checkmk_kube_agent/send_metrics.py does use the Popen from subprocess in combination with wait(5). However, in the documentation and in a stackoverflow article it is not recommended to use this because it can lead to deadlocks. Therefore I created a pull request with the change who use communicate(timeout=5) which prevents the deadlock case.

Greetings
schmidax

schmidax · March 26, 2024, 7:04am

Hello,

after 20 days without action, I am now pinging @martin.hirschvogel
This Issue is on all my RKE2-Cluster (v1.27.10+rke2r1) and can be fixed with this pull request

Greetings
schmidax

martin.hirschvogel · March 26, 2024, 10:52am

Hey schmidax,

thanks for PR and great contribution! Have forwarded it to the team.

Cheers, Martin

martin.hirschvogel · April 16, 2024, 11:33am

The latest relese of checkmk_kube_agent 1.6.0 contains the pull requests.
The relevant changes are:

github.com

Checkmk/checkmk_kube_agent/blob/main/.werks/16417

Title: Add Configuration Option 'checkmkAgentTimeout'
Class: fix
Compatible: compat
Component: node-collector
Date: 1712152033
Knowledge: doc
Level: 1
Version: 2.0.0-alpha.1

The machine-sections-collector executes a version of the 'check_mk_agent' to collect information
about the host. Sometimes this script takes more than five seconds, which causes the following
traceback.

C+:
 File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds
C-:

If you encounter this error, you can configure a longer timeout via the new option

This file has been truncated. show original

github.com

Checkmk/checkmk_kube_agent/blob/main/.werks/16418

Title: Fix Don't Deadlock if OS.PIPE Overflows
Class: fix
Compatible: compat
Component: node-collector
Date: 1712153891
Knowledge: doc
Level: 1
Version: 2.0.0-alpha.1

The machine-sections-collector executes a version of the 'check_mk_agent' to collect information
about the host. Previously, if the script produced output to the extent that it had to wait for the
OS pipe buffer to accept more data, it would cause the machine-sections-collector to deadlock and
eventually the collector would timeout. This issue has now been fixed.