Checkmk agent high cpu usage with systemd

stager999 · December 20, 2021, 4:36pm

Hi

We are experiencing a strange problem with chemkmk agent which runs via systemd. It can happen (very unregularly) that the agent starts to use a lot of cpu (actually the whole cpu core). The restart of systemd or checkmk agent does not help. Only the reboot of the system helps. Has someone experienced similar issue? Have you got any solution or is the best solution to sue xined?

CMK version: 2.0.0p15
OS version: CentOS/Almalinux 8.x

BR, Joe

andreas-doehler · December 20, 2021, 4:38pm

First question or troubleshooting step is - what plugins are used on this system?
I had no such behavior on default agents without plugins.

stager999 · December 20, 2021, 7:05pm

The strange thing is that there is no additional plugin, just agent installed via the rpm package.
On the system with the problem we can see the following “dead” process:
1 33455 33455 33455 ? -1 Rs 0 345:32 /bin/bash /usr/bin/check_mk_agent
33455 13341 33455 33455 ? -1 Z 0 0:00 _ [check_mk_agent]

According to the debug of the check_mk agent script we think that the cause is the async part of the agent:
/etc/systemd/system/check-mk-agent-async.service

…

andreas-doehler · December 20, 2021, 8:29pm

If it is only the async service i would check how the agent behaves if it runs with xinetd.
But on the other side if you have a dead process, the reason for the dead process would be nice to know.

stager999 · December 20, 2021, 9:49pm

Thank you for the answer. Yes, my proposal/wish has also been to try with xinetd. I have never seen a similar problem with xinetd, However in this case the wish is not to install xinetd. I hope, we will be allowed to install xinetd.

stager999 · December 21, 2021, 8:09pm

The troubleshooting has showed that the service “check-mk-agent-async.service”, when the problem with cpu occurs, has two child processes “/usr/bin/check_mk_agent”.

After the restart of the “check-mk-agent-async.service” the status becomes normal and the second proces shows “sleep 60”.

You can see that the async service is now has a sleep process as it should according to the check_mk_agent script:

# if MK_LOOP_INTERVAL is set, we assume we're a 'simple' systemd service
if [ -n "$MK_LOOP_INTERVAL" ]; then
    while sleep "$MK_LOOP_INTERVAL"; do
        # Note: this will not output anything if MK_RUN_SYNC_PARTS=false, which is the intended case.
        # Anyway: rather send it to /dev/null than risk leaking unencrypted output.
        MK_LOOP_INTERVAL="" main >/dev/null
    done
fi

The async service uses systemd Type=simple (just starts the process and leaves it):

[Unit]
Description=Checkmk agent - Asynchronous background tasks

[Service]
ExecStart=/usr/bin/check_mk_agent
Environment='MK_RUN_SYNC_PARTS=false'
Environment='MK_LOOP_INTERVAL=60'

User=root

[Install]
WantedBy=multi-user.target

However, I see that there may be some changes since the agent code on github is different than yesterday and differs quite much in the end:
“checkmk/check_mk_agent.linux at master · tribe29/checkmk · GitHub”

andreas-doehler · December 21, 2021, 8:41pm

It is better to look at the 2.0 branch. The master branch is the dev branch for the next main version.

stager999 · December 21, 2021, 8:59pm

Oh, yes you are right. Thanks.

robin.gierse · December 22, 2021, 6:58am

Hi @stager999! I skipped through this thread, but I am unsure: Was this a one time issue, or can you reproduce the error?

stager999 · December 22, 2021, 7:30am

Hi, it is not one time, However, it appears randomly and at the moment we do not know how to reproduce it. We only know that the problem appears on CentOS/Almalinux 8.2, and 8.4. On some servers we have tried with the systemd update (the problem has not reappeared yet but it can take some weeks) but the at the moment we cannot do that on all servers since the reboot is needed.

flipsa · December 31, 2021, 1:10pm

I’m also affected by this. I run a test lab with a hand full of Debian 10/11 VMs which are monitored by CMK Raw running on Docker swarm (spanning some of those VMs). Since this is a test env I use fairly recent versions, currently I’m on version 2021.12.17.

When this issue happens - about every other day - usually most or all VMs show the same problem: check_mk-async taking 100% of 1 core. This starts on all affected VMs at the exact same time, and therefore makes me think, it’s an issue that is triggered by the server side, instead of the async agents in the VMs all at the same time running into an issue simultaneously…

Since I’m running CMK on a Docker swarm which does things automatically (load balancing, pulling updated images, restarting the container, etc.), maybe the problem is the server side becoming unreachable for some time? Unfortunately I have not seen anything obvious in any log files, but I’ll keep looking in that direction.

robin.gierse · January 3, 2022, 9:39am

Maybe some troubleshooting thoughts, to pinpoint the issue:

Did anyone check if the issue is persistent both with systemd and xinetd?
Did anyone check for SELinux? Is it enabled and enforcing on affected hosts?
Did anyone check the actual agent output during the time of CPU stress (cmk -d $HOSTNAME) ? Maybe one can understand which section causes the load.

linux-party · January 5, 2022, 11:57am

Hi,

We are experiencing the same problem on our data centers (DC). I don’t have the permission to install xinetd on servers. The current checkmk configuration is pretty basic.

Did anyone check for SELinux? Is it enabled and enforcing on affected hosts?

Selinux is disabled on every server in DC so it is unlikely to be the reason.

Did anyone check the actual agent output during the time of CPU stress (cmk -d $HOSTNAME) ? Maybe one can understand which section causes the load.

I used telnet to connect to port 6556 and the output looked Ok. At least at first glance.

Most common case when agent “crashes”:

$ ps axjf
	   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME    COMMAND
	      1 1377150 1377150 1377150 ?             -1 Ss       0  34:05    /bin/bash /usr/bin/check_mk_agent
	1377150 1624760 1377150 1377150 ?             -1 R        0 34689:11   \_ /bin/bash /usr/bin/check_mk_agent
	1624760 1624761 1377150 1377150 ?             -1 Z        0   0:00         \_ [systemctl] <defunct>

$ top
	top - 11:33:44 up 313 days, 17:42,  1 user,  load average: 1.86, 1.46, 1.32
	Tasks: 254 total,   3 running, 250 sleeping,   0 stopped,   1 zombie
	%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	%Cpu3  :  5.9 us,  5.9 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
	MiB Mem :   7944.5 total,    638.8 free,   3263.1 used,   4042.6 buff/cache
	MiB Swap:   4092.0 total,   4071.5 free,     20.5 used.   3945.8 avail Mem

	    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
	1624760 root      20   0   28480  10148   1272 R  88.2   0.1  34689:08 check_m+
	      1 root      20   0  255968  14080   9164 S   0.0   0.2   1329:17 systemd
	      2 root      20   0       0      0      0 S   0.0   0.0   0:36.93 kthreadd

Script “/usr/bin/check_mk_agent” hangs and its subprocess becomes a zombie as it is not collected. The 100% usage of 1 CPU core could be caused by some sort of infinite loop (which executes no commands, doesn’t temporarily release CPU with sleep, …).

The loop at the end of agent script looks ok though:

	# if MK_LOOP_INTERVAL is set, we assume we're a 'simple' systemd service
	if [ -n "$MK_LOOP_INTERVAL" ]; then
	    while sleep "$MK_LOOP_INTERVAL"; do
	        # Note: this will not output anything if MK_RUN_SYNC_PARTS=false, which is the intended case.
	        # Anyway: rather send it to /dev/null than risk leaking unencrypted output.
	        MK_LOOP_INTERVAL="" main >/dev/null
	    done
	fi

We attached strace to process and got no output.
Restart of check-mk-agent-async.service always releases the cpu.
The interval between crashes varies from 7 hours up to 5 or more days.

So far the problem was detected on:

Count; Release                                        Kernel
1    ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.14.2.el8_2.x86_64
19   ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.19.1.el8_2.x86_64
3    ; CentOS Linux release 8.2.2004 (Core)        ;4.18.0-193.28.1.el8_2.x86_64
5    ; CentOS Linux release 8.3.2011               ;4.18.0-240.10.1.el8_3.x86_64
1    ; AlmaLinux release 8.4 (Electric Cheetah)    ;4.18.0-240.22.1.el8_3.x86_64

Problem was **NOT** encountered on RHEL 7 or 8.

Updating agent’s package from p12 to p16 didn’t help:
check-mk-agent-2.0.0p12-1.eba2fbc587cbf845.noarch
check-mk-agent-2.0.0p16-1.ecc5e7dc7c2635d8.noarch

I created 12 virtual machines for testing and updated seperate parts (systemd, kernel, systemd + kernel, full system update). It looked as if the full update on CentOS 8 helps. But that is not a viable solution for data center.

The most stable solution was to install the agent and overwrite it with script from github under relase 2.0 ( https://github.com/tribe29/checkmk/blob/2.0.0/agents/check_mk_agent.linux ):

Install check-mk-agent package
check-mk-agent-2.0.0p12-1.eba2fbc587cbf845.noarch or
check-mk-agent-2.0.0p16-1.ecc5e7dc7c2635d8.noarch
Copy checkmk agent script from github (release 2.0)

$ wget https://github.com/tribe29/checkmk/raw/2.0.0/agents/check_mk_agent.linux
$ cp -vf check_mk_agent.linux /usr/bin/check_mk_agent

Reload services

#Reload systemd
$ systemctl daemon-reload

#Checkmk agent - restart socket service
$ systemctl restart check-mk-agent.socket

#Checkmk agent - restart Asynchronous background tasks
$ systemctl restart check-mk-agent-async.service

Only other file used was /etc/check_mk/encryption.cfg to enable encryption.

After this procedure there were no crashes observed so far.

Tried the same with script from master branch (github), but discovery reported that df command has problems. Dev branch, so i’ll try again later

I hope any of this information helps.

robin.gierse · January 5, 2022, 12:56pm

Wow, this is some serious troubleshooting and reporting, thanks a lot @linux-party! I cannot give you a badge for that, but consider this token of appreciation:
Also, a warm welcome to the community!

Just pinging @moritz here real quick, maybe he has some thoughts on your post.

I get two main thoughts from your troubleshooting:

The issue occurs on CentOS/Alma/Rocky exclusively, not on RHEL, while a fully patched CentOS 8 seems to work fine.
The most recent (unreleased) linux agent seems to fix the issue too.

Are those conclusions right from your point of view?

tosch · January 5, 2022, 1:40pm

The master branch contains a new subsection for df check which isn’t handled by the 2.0 version of the check so far. See my finding at the following post and the explanation from Andreas below:

linux-party · January 6, 2022, 9:17am

I get two main thoughts from your troubleshooting:

The issue occurs on CentOS/Alma/Rocky exclusively, not on RHEL, while a fully patched CentOS 8 seems to work fine.

The most recent (unreleased) linux agent seems to fix the issue too.

Thank you and yes that’s about it. I’m only using Alma Linux, so Rocky could perform differently.

The master branch contains a new subsection for df check which isn’t handled by the 2.0 version of the check so far. See my finding at the following post…

Thanks i’ll look into it.

tawi · February 3, 2022, 7:39am

Hi @robin.gierse

We have the same issue, but we are using RHEL8 with CEE 2.0.0p12. I will try the way @linux-party mentioned and replace the check_mk_agent 2.0.0p12 file with the check_mk_agent.linux 2.0.0.p20 and see if this will make any difference. I will give you feedback about the result.

robin.gierse · February 3, 2022, 7:56am

Hi Tanja,

why not update your whole setup to the current 2.0 patch release? That is generally advisable anyway.

tawi · February 3, 2022, 8:10am

Hey @robin.gierse

this will be the next step if the test finishes successfully

tawi · March 16, 2022, 8:49am

@robin.gierse

Just a short feedback to this issue: Using 2.0.0.p20 (just the client) worked for about 2 weeks and then I ran into the old problems. So upgrading our whole distributed setup would not help.
Any other hints what I could try to fix this problem? I cannot switch to automatic agent update if all clients stop checking for updates after some days or weeks and the only solution to this is to restart the async client manually on the monitored hosts.

@linux-party

Do you also still have these problems? Or did you find a way to solve this issue?

Best regards
Tanja