CrashbackLoopOff for Node-collector machine-sections on large nodes

Somadin · December 2, 2022, 8:28am

CMK version:
2.1.0.p9
OS version:
Ubuntu 22.04.1 LTS
Error message:

hluerssen@bpsl086199:~$ kubectl get pods
NAME                                             READY   STATUS             RESTARTS        AGE
checkmk-cluster-collector-59766d5445-b654z       1/1     Running            0               52m
checkmk-node-collector-container-metrics-7qs7k   2/2     Running            0               51m
checkmk-node-collector-container-metrics-99w6d   2/2     Running            0               50m
checkmk-node-collector-container-metrics-9s4gv   2/2     Running            0               48m
checkmk-node-collector-container-metrics-c68qv   2/2     Running            0               49m
checkmk-node-collector-container-metrics-h7q4j   2/2     Running            0               50m
checkmk-node-collector-container-metrics-k9wsk   2/2     Running            0               51m
checkmk-node-collector-container-metrics-m99qm   2/2     Running            0               49m
checkmk-node-collector-container-metrics-ndf5p   2/2     Running            0               52m
checkmk-node-collector-container-metrics-zw7fw   2/2     Running            0               51m
checkmk-node-collector-machine-sections-6vgkm    1/1     Running            0               4d1h
checkmk-node-collector-machine-sections-8xh6g    1/1     Running            0               4d1h
checkmk-node-collector-machine-sections-c8lbg    0/1     CrashLoopBackOff   14 (4m4s ago)   52m
checkmk-node-collector-machine-sections-gsq46    1/1     Running            0               4d1h
checkmk-node-collector-machine-sections-ngst7    1/1     Running            0               4d1h
checkmk-node-collector-machine-sections-pc6gt    1/1     Running            0               52m
checkmk-node-collector-machine-sections-sl67h    1/1     Running            2 (3d ago)      4d1h
checkmk-node-collector-machine-sections-vk7wx    1/1     Running            1 (3d ago)      4d1h
checkmk-node-collector-machine-sections-xshxv    1/1     Running            0               3d
hluerssen@bpsl086199:~$ kubectl logs checkmk-node-collector-machine-sections-c8lbg
DEBUG:   2022-12-02 08:18:42,906 - Parsed arguments: Namespace(host='checkmk-cluster-collector.checkmk-monitoring', port=8080, secure_protocol=True, max_retries=10, connect_timeout=10, read_timeout=12, polling_interval=60, verify_ssl=True, ca_cert='/etc/ca-certificates/checkmk-ca-cert.pem', log_level='debug')
DEBUG:   2022-12-02 08:18:42,906 - Cluster collector base url: https://checkmk-cluster-collector.checkmk-monitoring:8080
INFO:    2022-12-02 08:18:42,906 - Querying Checkmk Agent for node data
Traceback (most recent call last):
  File "/usr/local/bin/checkmk-machine-sections-collector", line 8, in <module>
    sys.exit(main_machine_sections())
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 471, in _main
    worker(session, cluster_collector_base_url, headers, verify)
  File "/usr/local/lib/python3.10/site-packages/checkmk_kube_agent/send_metrics.py", line 376, in machine_sections_worker
    returncode = process.wait(5)
  File "/usr/local/lib/python3.10/subprocess.py", line 1207, in wait
    return self._wait(timeout=timeout)
  File "/usr/local/lib/python3.10/subprocess.py", line 1933, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/usr/local/bin/check_mk_agent']' timed out after 5 seconds

Hello everyone,

similar to the issue described here one of my machine-sections collector remains in CrashLoopBackOff.
We did the installation with helm already, so the video mentioned in the other post did not provide any additional insight.
I suspect that the timeout mentioned in the error message is due to the size of the node (48 vCPUs, 512 GB) which is also the only difference I can make out between this node and all the other ones where it is running perfectly fine.
Is there a way to adjust the timeout, at least for debugging purposes?

Best Regards
Hendrik

Brice187 · January 4, 2023, 1:49pm

I am stuck on a huge bare-metal node too

returncode = process.wait(5)

seems to be hardcoded…

Running the command in a modified container (command: tail -f /dev/null):

/ $ time /usr/local/bin/check_mk_agent
<<<check_mk>>>
Version: 2022.09.01
AgentOS: kube 1.0.0
<<<kernel>>>
1672840482
...
<<<lnx_container_host_if:sep(09)>>>
real	0m 0.51s
user	0m 0.11s
sys	0m 0.05s

/ $ echo $?
0

So, no timeout and exit-code 0

Somadin · January 4, 2023, 2:37pm

We figured out that the problem was actually not due to the timeout configured there but instead it was a broken networkpolicy for ingress traffic. Unfortunately we did not have the time to analyze this further, for now we just removed the networkpolicies for check_mk.

No idea why it only occured on larger nodes though…

Brice187 · January 4, 2023, 2:39pm

$ kubectl get networkpolicies.networking.k8s.io -n checkmk-monitoring 
No resources found in checkmk-monitoring namespace.

I can reach the collector inside the container:

/ # wget checkmk-cluster-collector.checkmk-monitoring:8080
Connecting to checkmk-cluster-collector.checkmk-monitoring:8080 (10.43.117.51:8080)
wget: server returned error: HTTP/1.1 403 Forbidden

Brice187 · January 4, 2023, 3:05pm

So I changed the code to show the output (stdout) of /usr/local/bin/check_mk_agent

/ # /usr/local/bin/checkmk-machine-sections-collector --log-level info
INFO:	 2023-01-04 14:52:23,027 - Querying Checkmk Agent for node data
INFO:	 2023-01-04 14:52:23,028 - <_io.BufferedReader name=3>

So, it was a _io.BufferedReader object instead of a str. I think the agent expects can only handly str properly. Adding a logger statement and performing a .read() seems to do the trick.:

with subprocess.Popen(  # nosec                                                    
        ["/usr/local/bin/check_mk_agent"],                                             
        stdout=subprocess.PIPE,                                                        
    ) as process:                                                                      
        logger.debug(process.stdout.read())                                            
        returncode = process.wait(5)

Output of /usr/local/bin/checkmk-machine-sections-collector --log-level info in the container:

/ # /usr/local/bin/checkmk-machine-sections-collector --log-level info
INFO:	 2023-01-04 15:04:52,347 - Querying Checkmk Agent for node data
INFO:	 2023-01-04 15:04:53,035 - Parsing and sending machine sections
INFO:	 2023-01-04 15:04:53,074 - Successfully sent machine sections to cluster collector
INFO:	 2023-01-04 15:04:53,074 - Worker finished in 0.73 seconds

Created a PR Add debug Output by Brice187 · Pull Request #15 · tribe29/checkmk_kube_agent · GitHub and hopefully it will be integrated soon

martin.hirschvogel · January 10, 2023, 1:20pm

Thanks! Only saw it now. Next time ping me or @MarcNe

martin.hirschvogel · January 16, 2023, 2:54pm

We looked at it - we might fix it in a different way though, the PR helps us navigating this issue! Thanks for that.
Depending how complex the underlying issue is, we will either fix it this sprint or in the following sprint.

martin.hirschvogel · January 23, 2023, 7:24pm

Hey @Brice187

we looked at your pull request. What your read does is to empty the stdout cache, thus no section will be created and thus also no problem - done, but also no monitoring
Please check our comment in the Pull Request, as we would like to understand the underlying issue better.

Cheers,
Martin

system · January 23, 2024, 7:24pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.