Systemd Service Summary

I’ve upgraded all my sites to 1.6.0p5, and updated the all of my Oracle Linux systems to the latest agent, which includes a ‘Systemd Service Summary’ service.

Great idea, but 99.9% of the warnings are from the ‘check_mk’ service on my EL7 systems.

Any hints?

I’ve already added a ‘trap “exit 0” EXIT’ as the second line, to ensure the script always returns and exit code of zero.

And periodically run an Ansible playbook against all systems to reset failed services periodically. (ansible hostnames -ba ‘systemctl reset-failed check_mk*’)

1 Like

I suggest to check which services failed and why instead of resetting the failures without taking a look.

Regards
Racke

Hi,

same problem here. In most of all cases “check_mk*” is the problem listed in systemd.
Hints wanted… we are using CentOS 7.x

Regards
Guenther

1 Like

Hi,

for further debugging you could use:

systemctl status check_mk.socket
systemd-cgls -au system-check_mk.slice

I’m with Racke, don’t reset failures without having a look.

Best regards
Ronny

1 Like

Hi Ronny,
the only thing I’ve found was an entry at /var/log/messages:

ov 23 16:34:35 systemd: Starting Check_MK (:33118)…
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-:33118.service start operation timed out. Terminating.
Nov 23 16:36:50 systemd: Failed to start Check_MK (:33118).
Nov 23 16:36:50 systemd: Unit check_mk@914-10.2.16.144:6556-:33118.service entered failed state.
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-:33118.service failed.
Nov 23 16:36:50 systemd: Starting Check_MK (:43208)…
Nov 23 16:36:55 systemd: Started Check_MK (:43208).

Check-mk is running normally, but the systemd check goes critical after that. Any suggestions
for configuring the agent under /etc/xinet.d/check_mk ?

Could it be the case to set wait = yes in /etc/xinet.d/check_mk. It’s a single call per check intervall
right?

Regards
Guenther

Thanks for the hints so far.

So far I have have updated the trap, and had it call a cleanup function, and within the function, I had it write a status line to a log file, using the ‘date’ command, to get a date and time stamp, process ID, and the number of seconds the script had been running, and return value the function had received.

I then reused the ‘data’ command, and injected it after each block of code, using the function of the previous block as a label for that date command.

The lead me to notice a random large jump in ‘plugins’. A few more ‘date’ lines inside the ‘plugins’ code revealed that the mk_inventory plugin was the culprit. I moved it from /usr/lib/check_mk_agent/plugins/ to …/plugins/300/, and that resolved 99% of my problems.

Now on to the last 1%.

Note:
With mk_inventory in plugins/ it was being running within the check. Most of the time, this wasn’t an issue, because it would only run to completion, once every four hours, and on most systems, it could run to completion within the 60 seconds allocated to running checks. On the few combinations of systems and loads, it could not, we got a failed service check, and thus a systemd failed service warning. By moving it to plugins/300/ we moved it out of the time window, and thus, fewer failures.

Since the mk_inventory script only ran to completion once every 4 hours (default), running it cached, with a 5 minute window, is an acceptable hit.

1 Like

I think the problem is that you are running check_mk from xinetd and thus check_mk.socket fails.
If you don’t want to use check_mk.socket via systemd, disable it.

systemctl disable check_mk.socket

should do the trick.

Disabled systemd units will shown in the systemd service summary, but not flagged as CRIT.

Regards
Racke

Hi,
no. xinetd is not installed at this system but check_mk installs a default configuration for
xinetd? Strange…

It’s systemd only for now and for future use.

Regards
Guenther

Hello Guenther,

can you please show the output of the following commands as Ronny suggested:

systemctl status check_mk.socket
systemd-cgls -au system-check_mk.slice

Regards
Racke

systemctl status check_mk.service: failed

I have to do a 'systemctl -a | grep check_mk and select one of the failed entries.

[mstier@cinscplp106 ~]$ sudo systemctl status check_mk@26238-167.254.216.7:6556-168.127.18.5:49990.service
● check_mk@26238-167.254.216.7:6556-168.127.18.5:49990.service - Check_MK (168.127.18.5:49990)
Loaded: loaded (/etc/systemd/system/check_mk@.service; static; vendor preset: disabled)
Active: failed (Result: timeout) since Mon 2019-11-25 10:48:01 CST; 5h 26min ago
Process: 43313 ExecStart=/usr/bin/check_mk_agent (code=killed, signal=TERM)

Nov 25 10:46:30 cinscplp106 systemd[1]: Starting Check_MK (168.127.18.5:4999…
Nov 25 10:48:01 cinscplp106 systemd[1]: check_mk@26238-167.254.216.7:6556-16…
Nov 25 10:48:01 cinscplp106 systemd[1]: Failed to start Check_MK (168.127.18…
Nov 25 10:48:01 cinscplp106 systemd[1]: Unit check_mk@26238-167.254.216.7:65…
Nov 25 10:48:01 cinscplp106 systemd[1]: check_mk@26238-167.254.216.7:6556-16…
Hint: Some lines were ellipsized, use -l to show in full.
[mstier@cinscplp106 ~]$

systemd-cgls -au system-check_mk.slice: failed

Invalid option ‘-u’

If I drop the ‘-u’ option, I get:

Failed to list cgroup tree system-check_mk.slice: Invalid argument
system-check_mk.slice:

FYI:

Most of my systems are Oracle Linux, and all of these problem systems are Oracle Linux 7. (Think Red Hat 7)

Also, I have just completed a quarterly patch cycle, so they are Update 7.

Matthew,
you have still problems with your …/plugins/300 workaround? I’m testing it too in our
environment

Regards
Guenhter

systemctl status check_mk.service

This will not work - only like mentioned before - systemctl status check_mk.socket
check_mk.service is not started or activated.

All of the ‘systemd’ fails I’m currently fighting are timeouts (agent runs greater than 60 seconds), and no matter what I try, they will still exist.

I did get an e-mail stating that the systemd check does have a filter.

Go to ‘WATO - Configuration’; ‘Host & Service Parameters’; Search for ‘systemd’; Select ‘Systemd Services’; and add a filter for ‘check_mk@*’

1 Like

Same here on Ubuntu, I think this is a bug of the agent, and adding a filter is only a workaround, the services should not fail in the first place.

Hi, same problem here since switching to SystemD Socket instead of xinetd on Debian 8-10.

I’ve minimized those isses with an increased “Agent TCP connect timeout” to 35s. Nonetheless it happens here and there:

Jan 15 14:21:05 hostname systemd[1]: Starting Check_MK ([::1]:35266)…
Jan 15 14:22:06 hostname systemd[1]: Starting Check_MK ([::1]:35280)…
Jan 15 14:22:09 hostname systemd[1]: Started Check_MK ([::1]:35280).
Jan 15 14:22:35 hostname systemd[1]: check_mk@16243-::1:6556-::1:35266.service start operation timed out. Terminating.
Jan 15 14:22:35 hostname systemd[1]: Failed to start Check_MK ([::1]:35266).
Jan 15 14:22:35 hostname systemd[1]: Unit check_mk@16243-::1:6556-::1:35266.service entered failed state.
Jan 15 14:27:06 hostname systemd[1]: Starting Check_MK ([::1]:35293)…
Jan 15 14:28:06 hostname systemd[1]: Starting Check_MK ([::1]:35303)…
Jan 15 14:28:09 hostname systemd[1]: Started Check_MK ([::1]:35303).

The one failed connection remains as a failed systemd process unless it’s “reset-failed”.

I think there is a correlation with apt-daily tasks in the early morning or in general a (very) high load. I’ve already tried to ignore the output of this process, but it didn’t helped

root@hostname:~# cat /etc/systemd/system/check_mk@.service.d/override.conf
[Service]
ExecStart=
ExecStart=-/usr/bin/check_mk_agent
(see https://www.freedesktop.org/software/systemd/man/systemd.service.html Prefix -)

I don’t care for a single check_mk call running into timout and would like to get a solution here, too :slight_smile:

We see the mentioned issues with multiple hosts on checkmk Enterprise 1.6 p6, e.g. timeouts with multiple spawned

check-mk-agent@*.service

The checkmk agent socket check-mk-agent.socket is active and listening.

If I wanted to dig deeper into the problem, where could I start?

Best regards
Dennis

First you need to determine where in the code, the slow down is.

I copied the agent, and sprinkled in a few

echo "A $SECONDS" 

at the end of each block of code to see which block causes the biggest increase in time. (Change the letter, so each invocation has it’s own label)

Same Problem here, 1.6.0p6 and multiple Linuxoides.
Has someone found any Solution except a Filter?

should be fixed with werk # 10710

https://checkmk.de/check_mk-werks.php?werk_id=10710

1 Like