Checkmk agent high cpu usage with systemd

linux-party · January 5, 2022, 11:57am

Hi,

We are experiencing the same problem on our data centers (DC). I don’t have the permission to install xinetd on servers. The current checkmk configuration is pretty basic.

Did anyone check for SELinux? Is it enabled and enforcing on affected hosts?

Selinux is disabled on every server in DC so it is unlikely to be the reason.

Did anyone check the actual agent output during the time of CPU stress (cmk -d $HOSTNAME) ? Maybe one can understand which section causes the load.

I used telnet to connect to port 6556 and the output looked Ok. At least at first glance.

Most common case when agent “crashes”:

$ ps axjf
	   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME    COMMAND
	      1 1377150 1377150 1377150 ?             -1 Ss       0  34:05    /bin/bash /usr/bin/check_mk_agent
	1377150 1624760 1377150 1377150 ?             -1 R        0 34689:11   \_ /bin/bash /usr/bin/check_mk_agent
	1624760 1624761 1377150 1377150 ?             -1 Z        0   0:00         \_ [systemctl] <defunct>

$ top
	top - 11:33:44 up 313 days, 17:42,  1 user,  load average: 1.86, 1.46, 1.32
	Tasks: 254 total,   3 running, 250 sleeping,   0 stopped,   1 zombie
	%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu3  :  5.9 us,  5.9 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	MiB Mem :   7944.5 total,    638.8 free,   3263.1 used,   4042.6 buff/cache
	MiB Swap:   4092.0 total,   4071.5 free,     20.5 used.   3945.8 avail Mem

	    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
	1624760 root      20   0   28480  10148   1272 R  88.2   0.1  34689:08 check_m+
	      1 root      20   0  255968  14080   9164 S   0.0   0.2   1329:17 systemd
	      2 root      20   0       0      0      0 S   0.0   0.0   0:36.93 kthreadd

Script “/usr/bin/check_mk_agent” hangs and its subprocess becomes a zombie as it is not collected. The 100% usage of 1 CPU core could be caused by some sort of infinite loop (which executes no commands, doesn’t temporarily release CPU with sleep, …).

The loop at the end of agent script looks ok though:

	# if MK_LOOP_INTERVAL is set, we assume we're a 'simple' systemd service
	if [ -n "$MK_LOOP_INTERVAL" ]; then
	    while sleep "$MK_LOOP_INTERVAL"; do
	        # Note: this will not output anything if MK_RUN_SYNC_PARTS=false, which is the intended case.
	        # Anyway: rather send it to /dev/null than risk leaking unencrypted output.
	        MK_LOOP_INTERVAL="" main >/dev/null
	    done
	fi

We attached strace to process and got no output.
Restart of check-mk-agent-async.service always releases the cpu.
The interval between crashes varies from 7 hours up to 5 or more days.

So far the problem was detected on:

Count; Release                                        Kernel
1    ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.14.2.el8_2.x86_64
19   ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.19.1.el8_2.x86_64
3    ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.28.1.el8_2.x86_64
5    ; CentOS Linux release 8.3.2011               ;4.18.0-240.10.1.el8_3.x86_64
1    ; AlmaLinux release 8.4 (Electric Cheetah)    ;4.18.0-240.22.1.el8_3.x86_64

Problem was **NOT** encountered on RHEL 7 or 8.

Updating agent’s package from p12 to p16 didn’t help:
check-mk-agent-2.0.0p12-1.eba2fbc587cbf845.noarch
check-mk-agent-2.0.0p16-1.ecc5e7dc7c2635d8.noarch

I created 12 virtual machines for testing and updated seperate parts (systemd, kernel, systemd + kernel, full system update). It looked as if the full update on CentOS 8 helps. But that is not a viable solution for data center.

The most stable solution was to install the agent and overwrite it with script from github under relase 2.0 ( https://github.com/tribe29/checkmk/blob/2.0.0/agents/check_mk_agent.linux ):

Install check-mk-agent package
check-mk-agent-2.0.0p12-1.eba2fbc587cbf845.noarch or
check-mk-agent-2.0.0p16-1.ecc5e7dc7c2635d8.noarch
Copy checkmk agent script from github (release 2.0)

$ wget https://github.com/tribe29/checkmk/raw/2.0.0/agents/check_mk_agent.linux
$ cp -vf check_mk_agent.linux /usr/bin/check_mk_agent

Reload services

#Reload systemd
$ systemctl daemon-reload

#Checkmk agent - restart socket service
$ systemctl restart check-mk-agent.socket

#Checkmk agent - restart Asynchronous background tasks
$ systemctl restart check-mk-agent-async.service

Only other file used was /etc/check_mk/encryption.cfg to enable encryption.

After this procedure there were no crashes observed so far.

Tried the same with script from master branch (github), but discovery reported that df command has problems. Dev branch, so i’ll try again later

I hope any of this information helps.