Constant issues with k8s node metrics collector on physical k8s nodes with timeout error:
File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>
sys.exit(main_machine_sections())
File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
worker(session, cluster_collector_base_url, headers, verify)
File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker
returncode = process.wait(5)
File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait
return self._wait(timeout=timeout)
File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds
Just in the moment I do not know how to run /bin/sh scripts like with /bin/bash -x
but at least the real agent that is called by the /usr/local/bin/check_mk_agent wrapper knows the -d (debug) switch.
/usr/local/bin/check_mk_agent.openwrt -d
When I read CentOS7 I mediately think of SELinux, you may want to try to check/disable SELinux on the Kubernetes hosts.
Hi, I have the same problem here. At the moment with Kubernetes 1.24.6 (before upgrading it, I tried to upgrade anything else). Helm Chart version is the newest (1.5.1), came from 1.4.1 and had no issues there.
I also have two clusters running checkmk clustercollector without any issue in version 1.5.1 and the systems should be identical.
Any ideas?
Stream closed EOF for checkmk-monitoring/checkmk-clustercollector-node-collector-machine-sections-fwdhk (machine-sections-collector) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 Traceback (most recent call last): │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module> │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 sys.exit(main_machine_sections()) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 worker(session, cluster_collector_base_url, headers, verify) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 374, in machine_sections_worker │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 returncode = process.wait(5) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 File "/usr/local/lib/python3.10/subprocess.py", line 1209, in wait │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 return self._wait(timeout=timeout) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 File "/usr/local/lib/python3.10/subprocess.py", line 1935, in _wait │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 raise TimeoutExpired(self.args, timeout) │
│ checkmk-clustercollector-node-collector-machine-sections-c22c4 subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds
I have exactly the same error and have been able to find out what the problem is. Currently the file src/checkmk_kube_agent/send_metrics.py does use the Popen from subprocess in combination with wait(5). However, in the documentation and in a stackoverflow article it is not recommended to use this because it can lead to deadlocks. Therefore I created a pull request with the change who use communicate(timeout=5) which prevents the deadlock case.
after 20 days without action, I am now pinging @martin.hirschvogel
This Issue is on all my RKE2-Cluster (v1.27.10+rke2r1) and can be fixed with this pull request