K8s node-collector-container-metrics unstable when cadvisor does not respond

CMK version:
The problem I am experience is a crashing node-collector-container-metrics pod. I am using the helm chart installation with AppVersion 1.5.1.

OS version:
Kubernetes 1.23.17 in combination with Containerd on Ubuntu 22.04

Error message:
Error of container-metrics-collector/kubernetes-collector:

INFO:    2023-12-30 17:46:10,080 - Parsing and sending container metrics
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 497, in _make_request
    conn.request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 409, in request
    self.send(chunk)
  File "/usr/local/lib/python3.10/http/client.py", line 998, in send
    self.sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 497, in _make_request
    conn.request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 409, in request
    self.send(chunk)
  File "/usr/local/lib/python3.10/http/client.py", line 998, in send
    self.sock.sendall(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/checkmk-container-metrics-collector", line 8, in <module>
    sys.exit(main_container_metrics())
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
    worker(session, cluster_collector_base_url, headers, verify)
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 336, in container_metrics_worker
    cluster_collector_response = session.post(
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
n/a

The problem I experience is that the daemonset running on my nodes is unstable. They sometimes run for a couple of minutes without a problem but suddenly crash with a Python error.
After my own investigation I already found out that the problem is related to the communication with cAdvisor which runs as a separate container in the pod. The CheckMK collector tries to collect metrics on localhost:8080 (cAdvisor) but does not get a response/the connection is reset and after a couple of failed tries the collector eventually crashes while the cAdvisor container in the pod is still running. Eventually Kubernetes restarts the pod because of the crashed container in the pod.
I do not experience this problem in my test environment which is running the same software versions. The only difference is the amount of running pods per node. That is why I think this could be related to the amount of pods running on a node and how cAdvisor handles that.

I do not know what version of cAdvisor is being used, because CheckMK uses a special patched version of cAdvisor.
Anyone else experience this issue or have any clue how to troubleshoot this issue?

… sounds like a network, firewall, network policy problem.

According to

https://docs.checkmk.com/latest/en/monitoring_kubernetes.html?lquery=kubernetes#heading__supported_distributions_and_versions


Our goal is to support each of the last 5 released (minor) versions of Kubernetes.

So k8s 1.23 is not supported any more since 1.29 is the latest

You might want to try an older version of the helm chart for now.

Also limiting & throttling the pods can be a problem, perhaps they need more resources than you allowed.

I always suggest to exec into the container and check with wget if you get data from cadvisor.

e.g.

kubectl -n cmkmon exec -it myrelase1-checkmk-node-collector-container-metrics-vrrlj -c cadvisor -- /bin/ash

# wget -O - http://localhost:8080/metrics

Thanks for your reply!

… sounds like a network, firewall, network policy problem.

I understand your thinking but unfortunately I do not think it is network related. What I maybe did not mention clear enough is that the connection works and stats are gathered for cAdvisor for a while. But after a period of time, the errors occurs and other connectivity keeps working. And I do not experience the problem on a smaller cluster (same set-up).

So k8s 1.23 is not supported any more since 1.29 is the latest

I know that the kubernetes version is not in support anymore, I plan to upgrade this month to a higher release. I hope it helps but I have the problem for a year now, so I don’t think it will make such a difference. I even upgraded the version to a newer one, hoping it would solve my problem.

I always suggest to exec into the container and check with wget if you get data from cadvisor.

This works fine, but the moment that CheckMK Collector fails after a couple of tries, the container will be restarted by Kubernetes and continues to work again (without restarting the cAdvisor container in this pod)

I would like to know the version of the cAdvisor that CheckMK is using and what the ‘patching’ is they do.

I updated the DaemonSet to create debug logging which created more information that the issue seem to be related to the publishing of data to the cluster-collector. Which is running fine with one replica, but gets a lot of requests from all the collectors of course.

DEBUG:   2024-01-02 18:10:55,117 - Parsed arguments: Namespace(host='cmk-monitoring-checkmk-cluster-collector.checkmk-monitoring', port=8080, secure_protocol=False, max_retries=10, connect_timeout=10, read_timeout=12, polling_interval=60, verify_ssl=False, ca_cert='/etc/ca-certificates/checkmk-ca-cert.pem', log_level='debug')
DEBUG:   2024-01-02 18:10:55,118 - Cluster collector base url: http://cmk-monitoring-checkmk-cluster-collector.checkmk-monitoring:8080
INFO:    2024-01-02 18:10:55,118 - Querying cadvisor version
DEBUG:   2024-01-02 18:10:55,123 - Starting new HTTP connection (1): localhost:8080
DEBUG:   2024-01-02 18:10:55,124 - http://localhost:8080 "GET /api/v2.0/version HTTP/1.1" 200 9
DEBUG:   2024-01-02 18:10:55,124 - cadvisor version b'"v0.47.2"'
INFO:    2024-01-02 18:10:55,124 - Querying container metrics
DEBUG:   2024-01-02 18:10:56,916 - http://localhost:8080 "GET /metrics HTTP/1.1" 200 None
INFO:    2024-01-02 18:10:58,049 - Parsing and sending container metrics
DEBUG:   2024-01-02 18:11:04,007 - Parsed 26426 container metrics
DEBUG:   2024-01-02 18:11:05,916 - Starting new HTTP connection (1): cmk-monitoring-checkmk-cluster-collector.checkmk-monitoring:8080
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1374, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1374, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/checkmk-container-metrics-collector", line 8, in <module>
    sys.exit(main_container_metrics())
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 466, in _main
    worker(session, cluster_collector_base_url, headers, verify)
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 336, in container_metrics_worker
    cluster_collector_response = session.post(
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

It let me thinking… about the resources that CheckMK by default configures for the deployment of the cluster collector. And CPU throttling was definitely happening.
I configured a high amount of CPU and Memory and the situation is almost stable, so definitely related to resources. Something I had to find myself ;-). Thanks again!

@PoudNL, thank you for your update.

I also do not think that k8s 1.23 is the problem.

In the container, there are some tools available that may help to understand the root cause.

E.g. look at top, netstat -an, RX/TX packets counter of ifconfig -a, lsof

I am wondering if you hit some kind of limit (like max open files, or too many tcp connections in TIME_WAIT or something like this.

Hi,

running 1.5.1 with an 1.27.10 Kubernetes Cluster. Same Problem, same solution. Increasing the cpu and memory limit for the cluster-collector deployment solved the problem.

bye
David