CMK version:
The problem I am experience is a crashing node-collector-container-metrics pod. I am using the helm chart installation with AppVersion 1.5.1.
OS version:
Kubernetes 1.23.17 in combination with Containerd on Ubuntu 22.04
Error message:
Error of container-metrics-collector/kubernetes-collector:
INFO: 2023-12-30 17:46:10,080 - Parsing and sending container metrics
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 497, in _make_request
conn.request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 409, in request
self.send(chunk)
File "/usr/local/lib/python3.10/http/client.py", line 998, in send
self.sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 845, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 497, in _make_request
conn.request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 409, in request
self.send(chunk)
File "/usr/local/lib/python3.10/http/client.py", line 998, in send
self.sock.sendall(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/checkmk-container-metrics-collector", line 8, in <module>
sys.exit(main_container_metrics())
File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
worker(session, cluster_collector_base_url, headers, verify)
File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 336, in container_metrics_worker
cluster_collector_response = session.post(
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
n/a
The problem I experience is that the daemonset running on my nodes is unstable. They sometimes run for a couple of minutes without a problem but suddenly crash with a Python error.
After my own investigation I already found out that the problem is related to the communication with cAdvisor which runs as a separate container in the pod. The CheckMK collector tries to collect metrics on localhost:8080 (cAdvisor) but does not get a response/the connection is reset and after a couple of failed tries the collector eventually crashes while the cAdvisor container in the pod is still running. Eventually Kubernetes restarts the pod because of the crashed container in the pod.
I do not experience this problem in my test environment which is running the same software versions. The only difference is the amount of running pods per node. That is why I think this could be related to the amount of pods running on a node and how cAdvisor handles that.
I do not know what version of cAdvisor is being used, because CheckMK uses a special patched version of cAdvisor.
Anyone else experience this issue or have any clue how to troubleshoot this issue?