Issue connecting to cluster collector for kubernetes

Hi all. I am testing this product as a possible monitoring solution for the Kubernetes infra that we are building right now.

This is the setup. We are running Tanzu Kubernetes with F5 virtual server ingress. We used the official helm chart to retrieve the values.yaml and configured the following

service:
    # if required specify "NodePort" here to expose the cluster-collector via the "nodePort" specified below
    **type: ClusterIP**
    port: 8080
    targetPort: 30035
    annotations:
      **nodeportlocal.antrea.io/enabled: "true"**

i would upload some files but i am apperently not allowed to. but i got the publishing right because i can reacht the https://serversname/docs website and i can authorize with the token from the secrets that has been created during the deployment

We configured special agent as per documentation for the cluster controller but we receive this error


Status: Setup Error (Failure to establish a connection to cluster collector at URL [![](http://10.8.230.14/site01/check_mk/themes/facelift/images/icon_link.png)](https://checkmk-sf-tkc-0001.dev.***.com/metadata) )**CRIT** , Nodes with container collectors: 1/3, Nodes with machine collectors: 1/3

now, when i open this link i get a

{
    "detail": "Not authenticated"
}

when i use a rest client i get “ERROR - An unknown network error occured”

when i look at the pod events i get this


Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  30m (x150 over 20h)    kubelet  (combined from similar events): Liveness probe failed: Get "http://100.64.2.37:10050/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  25m (x2048 over 22h)   kubelet  Readiness probe failed: Get "http://100.64.2.37:10050/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  10m (x1757 over 22h)   kubelet  Liveness probe failed: Get "http://100.64.2.37:10050/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Pulled     5m28s (x226 over 23h)  kubelet  Container image "checkmk/kubernetes-collector:1.5.1" already present on machine
  Warning  BackOff    35s (x1697 over 22h)   kubelet  Back-off restarting failed container cluster-collector in pod checkmk-controller-cluster-collector-79cb646447-ghh8q_checkmk-monitoring-dev(0024e15b-4ab7-4f36-9eda-16da9f073eb4)

so the pod consistently failing but i dont know why.

Any idea whats going on?

looking at this part of the events output i wonder why it is trying to do the health check on port 10050. i did not configure this port anywhere. and we have a deny-all rule set by default. is there some intra-pod communication going on for this to work?

furthermore. after redeploying the collector with log set to “info” i can share this


 kubectl logs -n checkmk-monitoring-dev pods/checkmk-controller-cluster-collector-584f77dd4f-xzznj
[2024-03-08 13:12:28 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2024-03-08 13:12:28 +0000] [1] [INFO] Listening at: http://0.0.0.0:10050 (1)
[2024-03-08 13:12:28 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2024-03-08 13:12:28 +0000] [7] [INFO] Booting worker with pid: 7
[2024-03-08 13:12:28 +0000] [7] [INFO] Started server process [7]
[2024-03-08 13:12:28 +0000] [7] [INFO] Waiting for application startup.
[2024-03-08 13:12:28 +0000] [7] [INFO] Application startup complete.
100.64.3.1:46584 - "GET /health HTTP/1.1" 200
100.64.3.1:46596 - "GET /health HTTP/1.1" 200
100.64.3.1:37720 - "GET /health HTTP/1.1" 200
100.64.3.1:37712 - "GET /health HTTP/1.1" 200
100.64.3.1:60918 - "GET /health HTTP/1.1" 200
100.64.3.1:60910 - "GET /health HTTP/1.1" 200
100.64.3.1:37616 - "GET /health HTTP/1.1" 200
100.64.3.1:37618 - "GET /health HTTP/1.1" 200
100.64.3.1:39170 - "GET /health HTTP/1.1" 200
100.64.3.1:39162 - "GET /health HTTP/1.1" 200
100.64.3.1:51184 - "GET /health HTTP/1.1" 200
100.64.3.1:51186 - "GET /health HTTP/1.1" 200
100.64.3.1:47746 - "GET /health HTTP/1.1" 200
100.64.3.1:47758 - "GET /health HTTP/1.1" 200
10.8.236.3:58015 - "GET /docs HTTP/1.1" 200
10.8.236.3:58015 - "GET /openapi.json HTTP/1.1" 200
100.64.3.1:34846 - "GET /health HTTP/1.1" 200
100.64.3.1:34850 - "GET /health HTTP/1.1" 200
100.64.3.1:54932 - "GET /health HTTP/1.1" 200
100.64.3.1:54928 - "GET /health HTTP/1.1" 200
[2024-03-08 13:14:14 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:7)
[2024-03-08 13:14:14 +0000] [1] [WARNING] Worker with pid 7 was terminated due to signal 6
[2024-03-08 13:14:14 +0000] [12] [INFO] Booting worker with pid: 12
[2024-03-08 13:14:14 +0000] [12] [INFO] Started server process [12]
[2024-03-08 13:14:14 +0000] [12] [INFO] Waiting for application startup.
[2024-03-08 13:14:14 +0000] [12] [INFO] Application startup complete.
100.64.3.1:34656 - "GET /health HTTP/1.1" 200
100.64.3.1:34652 - "GET /health HTTP/1.1" 200
100.64.3.1:55722 - "GET /health HTTP/1.1" 200
100.64.3.1:55724 - "GET /health HTTP/1.1" 200
10.8.236.3:58039 - "GET /metadata HTTP/1.1" 403
100.64.3.1:41942 - "GET /health HTTP/1.1" 200
100.64.3.1:41958 - "GET /health HTTP/1.1" 200
100.64.3.1:40584 - "GET /health HTTP/1.1" 200
100.64.3.1:40586 - "GET /health HTTP/1.1" 200
10.8.236.3:58082 - "GET /metadata HTTP/1.1" 403
100.64.3.1:40050 - "GET /health HTTP/1.1" 200
100.64.3.1:40048 - "GET /health HTTP/1.1" 200
10.8.236.3:58090 - "GET /metadata HTTP/1.1" 403
100.64.3.1:60144 - "GET /health HTTP/1.1" 200
100.64.3.1:60148 - "GET /health HTTP/1.1" 200
[2024-03-08 13:15:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
[2024-03-08 13:15:30 +0000] [1] [WARNING] Worker with pid 12 was terminated due to signal 6
[2024-03-08 13:15:30 +0000] [18] [INFO] Booting worker with pid: 18
[2024-03-08 13:15:30 +0000] [18] [INFO] Started server process [18]
[2024-03-08 13:15:30 +0000] [18] [INFO] Waiting for application startup.
[2024-03-08 13:15:30 +0000] [18] [INFO] Application startup complete.
100.64.3.1:58716 - "GET /health HTTP/1.1" 200
100.64.3.1:58718 - "GET /health HTTP/1.1" 200
100.64.3.1:50694 - "GET /health HTTP/1.1" 200
100.64.3.1:50692 - "GET /health HTTP/1.1" 200
100.64.3.1:50410 - "GET /health HTTP/1.1" 200
100.64.3.1:50408 - "GET /health HTTP/1.1" 200
100.64.3.1:49078 - "GET /health HTTP/1.1" 200
100.64.3.1:49084 - "GET /health HTTP/1.1" 200
100.64.3.1:49898 - "GET /health HTTP/1.1" 200
100.64.3.1:49904 - "GET /health HTTP/1.1" 200

HI,

the cluster collector needs to speak to the node collectors to get the data from there. See: Monitoring Kubernetes

If you run deny-all network policies, please take a look at checkmk_kube_agent/deploy/charts/checkmk/values.yaml at main · Checkmk/checkmk_kube_agent · GitHub

Hi Martin,

thanks for the feedback. I took the deny-all out for testing and will focus on that later. The collector was deployed and i do get feedback but now i am stuck with the daemonset.app which does not create the pods due to this error

Warning FailedCreate 13m daemonset-controller Error creating: pods "checkmk-controller-node-collector-container-metrics-22mm9" is forbidden: violates PodSecurity "baseline:latest": non-default capabilities (container "cadvisor" must not include "SYS_PTRACE" in securityContext.capabilities.add), hostPath volumes (volumes "var-run", "sys", "docker")

When i created the namespace i added the baseline policy

# creates a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: checkmk-monitoring-dev
  labels:
    pod-security.kubernetes.io/enforce: baseline

Chapter 2.2 of the docs

kubectl label --overwrite ns checkmk-monitoring pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/enforce-version=latest

Thank you. i adapted the namespace with the label and the controller is now green
I assume the information boxes below may take some time until populated with data?

looks like that no namespace etc… are showing up.

from the container metrics pods i found these and i wonder if thats an issue

 kubectl logs -n checkmk-monitoring-dev pods/checkmk-controller-node-collector-container-metrics-8qrs5
Defaulted container "cadvisor" out of: cadvisor, container-metrics-collector
W0311 12:55:20.633547       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 12:55:20.638098       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 12:55:20.644902       1 manager.go:286] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
W0311 13:00:21.478293       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:00:21.568070       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:05:21.477813       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:05:21.481852       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:10:21.478221       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:10:21.482494       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:15:21.478315       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:15:21.482965       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:20:21.477528       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:20:21.481982       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:25:21.478091       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:25:21.483224       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:30:21.478604       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:30:21.565532       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:35:21.477841       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:35:21.482476       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:40:21.477665       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:40:21.488067       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:45:21.477553       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:45:21.484129       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
W0311 13:50:21.477485       1 machine_libipmctl.go:64] There are no NVM devices!
W0311 13:50:21.482184       1 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"

In the dynamic host management i can see an error

14:29:22 ERROR An exception occured
Traceback (most recent call last):
  File "/omd/sites/kubemon_01/lib/python3/cmk/cee/dcd/connectors/piggyback.py", line 234, in _execute_phase2
    cmk_hosts = self._web_api.get_all_hosts()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/kubemon_01/lib/python3/cmk/cee/dcd/web_api.py", line 244, in get_all_hosts
    resp = self._session.get("/domain-types/host_config/collections/all")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/kubemon_01/lib/python3.11/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/omd/sites/kubemon_01/lib/python3/cmk/cee/dcd/web_api.py", line 151, in request
    raise MKAPIError(f"{response.text} (URL: {url})")
cmk.cee.dcd.web_api.MKAPIError: {"title": "Unauthorized", "status": 401, "detail": "Wrong credentials (Bearer header)"} (URL: http://localhost:5000/kubemon_01/check_mk/api/1.0/domain-types/host_config/collections/all)

digging into details with ssh. i can see data in the location -/omd/sites/kubemon_01/tmp/check_mk/piggyback

Is it possible that your automation user was changed or is not administrator anymore?
Or is this instance only a “slave” instance inside a distributed monitoring?