Kubernetes monitoring: [special_kube] Agent exited with code 1: Failed to establish a connection to xx.xx.xx.xx:6443 at URL /version

Hi everybody,

CMK version: 2.2.0p7
OS version: CMK server : RHEL 7.9

This is the documentation page that was used to set up the Kubernetes monitoring:

Error message:
Both the Check_MK and Check_MK Discovery services show the following error:

[special_kube] Agent exited with code 1: Failed to establish a connection to xx.xx.xx.xx:6443 at URL /version

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

+ PARSE FETCHER RESULTS
No persisted sections
  HostKey(hostname='<hostname>', source_type=<SourceType.HOST: 1>)  -> Not adding sections: Agent exited with code 1: Failed to establish a connection to xx.xx.xx.xx:6443 at URL /version
  HostKey(hostname='<hostname>', source_type=<SourceType.HOST: 1>)  -> Add sections: []
Received no piggyback data
[cpu_tracking] Start [7fd4acc201d0]
value store: synchronizing
Trying to acquire lock on /omd/sites/<site name>/tmp/check_mk/counters/<hostname>
Got lock on /omd/sites/<site name>/tmp/check_mk/counters/<hostname>
value store: loading from disk
Releasing lock on /omd/sites/<site name>/tmp/check_mk/counters/<hostname>
Released lock on /omd/sites/<site name>/tmp/check_mk/counters/<hostname>
No piggyback files for '<hostname>'. Skip processing.
[cpu_tracking] Stop [7fd4acc201d0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.009999999776482582))]
[special_kube] Agent exited with code 1: Failed to establish a connection to xx.xx.xx.xx:6443 at URL /version(!!), [piggyback] Success (but no data found for this host), execution time 0.8 sec | execution_time=0.780 user_time=0.000 system_time=0.000 children_user_time=0.670 children_system_time=0.070 cmk_time_ds=0.030 cmk_time_agent=0.000

I only added the sections where the error is shown to save space.

Does anyone have an idea why this error is occuring ?
Does anyone have an idea how to solve this error ?

Any assistance would be greatly appreceated.

Best regards,
Jacky

Yes your Checkmk cant talk to your K8, so if you sort that out the monitoring will work,

Hi Anders,

Thanx for replying.

Yes, that part is clear, but the reason why is unclear at this point as the k8s cluster is otherwise accessable to the checkmk server. And since it’s not clear why it’s not able to reach this part of the k8s cluster, we also don’t know what needs fixing so the check doesn’t give the current error.

Best regards,
Jacky

Check from the monitoring server if the Kubernetes API server is reachable.
E.g. run a curl … vs it.
If it returns a 403, then the rest should be straight forward. Anything besides this is a network issue.

HI Martin,

Thanx for replying.

I’ll run a curl tomorrow and see what it shows. Just not sure whether to hope for a 403 or something else as I’m not sure at this point which will lead to a quicker fix overall! :upside_down_face:

Best regards,
Jacky

@martin.hirschvogel

Hi Martin, I ran the curl with -v from the checkMK server and this is the output it generated:

curl -v https://xx.xx.xx.xx:6443/
* About to connect() to xx.xx.xx.xx port 6443 (#0)
*   Trying xx.xx.xx.xx...
* Connected to xx.xx.xx.xx (xx.xx.xx.xx) port 6443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* Server certificate:
*       subject: CN=kube-apiserver
*       start date: Nov 09 13:00:54 2023 GMT
*       expire date: Nov 08 13:00:54 2024 GMT
*       common name: kube-apiserver
*       issuer: CN=kubernetes
* NSS error -8179 (SEC_ERROR_UNKNOWN_ISSUER)
* Peer's Certificate issuer is not recognized.
* Closing connection 0
curl: (60) Peer's Certificate issuer is not recognized.
More details here: http://curl.haxx.se/docs/sslcerts.html

Would this be the same reason the checkMK services are being tripped up, or is this purely a curl issue with the certificate issuer?

We use self-signed certificates issues by the company’s certificate desk. Does this have to be indicated or something to checkMK in order for it to be OK with the certificates?

Could this also be an issue for checkMK when registering the agents for our non-kubernetes Linux / Windows servers?

Is this also something that would hinder setting up TLS for the checkmk-cluster-collector?

Best regards,
Jacky

This is good, because it means there is no connection problem.
For test purposes, you can just tell the Checkmk to ignore certificate errors.

You can also further try to debug it yourself using this article:
https://checkmk.atlassian.net/wiki/spaces/KB/pages/9470436/Debugging+the+Kubernetes+-+k8s+special+agent#DebuggingtheKubernetes-k8sspecialagent-DebuggingK8sspecialagent

Running the command in verbose with debug helped me often to trace the source of the issu

1 Like

Hi Martin,

Thanx for the additional suggestions & information. I’ve not yet had a chance to check it out or test any further lately, but will hopefully get some time to debug further by Friday.

Best regards,
Jacky