Kubernetes / Rancher 1.21 error

r.sander · November 3, 2022, 1:45pm

We are trying to setup the new Kubernetes monitoring in 2.1 with a K8s / Rancher cluster in version 1.21.
After tweaking the values.yaml file contents helm is able to deploy.
But then the pods for the node-collector cannot be started because the pod affinity / selectors do not match.

Any idea, @martin.hirschvogel ?

martin.hirschvogel · November 4, 2022, 7:47am

There are still some issues with Rancher, that’s why we don’t officially fully support it. But happy to help you get it running

My suggestion is to directly connect to a control plane node in the cluster you want to monitor and bridge the Rancher proxy. Then you can also skip the step in the current documentation regarding setting up a serviceaccount in Rancher.

Which values did you change? And can you share a kubectl logs of the respective container and a kube describe pod of the respective pod not starting?

jens.reinders.fio.de · November 11, 2022, 1:00pm

I gave it another try with a different cluster, running k8s 1.23 and utilizing RKE2 under the hood. The installation of the helm chart worked without problems.

jens.reinders.fio.de · December 5, 2022, 11:11am

The installation of the helm chart itself worked without problems, however I still face problems with the cluster collector.

The cluster collector reported “Failed attempting to communicate with cluster collector at URL…”. To get more details about the error, I changed the agent_kuby.py file

    except requests.HTTPError as e:
        print(e)
        raise CollectorHandlingException(
            title="Connection Error",
            detail=e.response.text,
#            detail=error_message,
        ) from e

Now I receive a more detailed error message

Connection Error ({"detail":"Access denied for Service Account checkmk-monitoring-cluster-collector in Namespace checkmk-monitoring! See logs for TokenReview."} )
CRIT, Nodes with container collectors: 2/3, Nodes with machine collectors: 3/3

I checked the logs of the pod itself, it reports

{
	"kind": "TokenReview",
	"apiVersion": "authentication.k8s.io/v1",
	"metadata": {
		"creationTimestamp": null,
		"managedFields": [
			{
				"manager": "python-requests",
				"operation": "Update",
				"apiVersion": "authentication.k8s.io/v1",
				"time": "2022-12-05T11:05:25Z",
				"fieldsType": "FieldsV1",
				"fieldsV1": {
					"f:spec": {
						"f:token": {}
					}
				}
			}
		]
	},
	"spec": {
		"token": "***token***"
	},
	"status": {
		"authenticated": true,
		"user": {
			"username": "system:serviceaccount:checkmk-monitoring:checkmk-monitoring-cluster-collector",
			"uid": "9e03836e-a3dc-432c-9897-f0e8a40b9423",
			"groups": [
				"system:serviceaccounts",
				"system:serviceaccounts:checkmk-monitoring",
				"system:authenticated"
			]
		},
		"audiences": [
			"https://kubernetes.default.svc.cluster.local",
			"rke2"
		]
	}
}

I understand this log in this way - the token was accepted, however some access still is denied. Where can I check for more details, which access exactly does not work?

martin.hirschvogel · December 19, 2022, 7:33am

Yes, the issue is known. The cluster collector verifies the communication by validating the used token internally. However, as in Rancher, you have to setup the ServiceAccount on Rancher and use the token of that ServiceAccount, this is not what the cluster collector expects.

The ideal solution would be to build compatibility on our side into the cluster collector, that it also accepts the Rancher-owned ServiceAccount. It is on our roadmap to do after we have finished OpenShift support and PVC+CronJob monitoring. All three things are currently in user tests.

The workaround for the time being is to bypass the Rancher API proxy and directly get the data from the control plane nodes of the specific cluster. For that, just follow the standard procedure from the docs, use the token from the ServiceAccount which is created by the Helm charts and specify a control plane node as the API endpoint.

system · December 19, 2023, 7:34am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.