Systemd Service Summary

MatthewStier · November 22, 2019, 9:07am

I’ve upgraded all my sites to 1.6.0p5, and updated the all of my Oracle Linux systems to the latest agent, which includes a ‘Systemd Service Summary’ service.

Great idea, but 99.9% of the warnings are from the ‘check_mk’ service on my EL7 systems.

Any hints?

I’ve already added a ‘trap “exit 0” EXIT’ as the second line, to ensure the script always returns and exit code of zero.

And periodically run an Ansible playbook against all systems to reset failed services periodically. (ansible hostnames -ba ‘systemctl reset-failed check_mk*’)

Racke · November 22, 2019, 9:21am

I suggest to check which services failed and why instead of resetting the failures without taking a look.

Regards
Racke

grasch · November 23, 2019, 8:49am

Hi,

same problem here. In most of all cases “check_mk*” is the problem listed in systemd.
Hints wanted… we are using CentOS 7.x

Regards
Guenther

_rb · November 23, 2019, 12:10pm

Hi,

for further debugging you could use:

systemctl status check_mk.socket
systemd-cgls -au system-check_mk.slice

I’m with Racke, don’t reset failures without having a look.

Best regards
Ronny

grasch · November 24, 2019, 6:28pm

Hi Ronny,
the only thing I’ve found was an entry at /var/log/messages:

Nov 23 16:34:35 systemd: Starting Check_MK (<ip-address of check-mk-server>:33118)...
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service start operation timed out. Terminating.
Nov 23 16:36:50 systemd: Failed to start Check_MK (<ip-address of check-mk-server>:33118).
Nov 23 16:36:50 systemd: Unit check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service entered failed state.
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service failed.
Nov 23 16:36:50 systemd: Starting Check_MK (<ip-address of check-mk-server>:43208)...
Nov 23 16:36:55 systemd: Started Check_MK (<ip-address of check-mk-server>:43208).

Check-mk is running normally, but the systemd check goes critical after that. Any suggestions
for configuring the agent under /etc/xinet.d/check_mk ?

Could it be the case to set wait = yes in /etc/xinet.d/check_mk. It’s a single call per check intervall
right?

Regards
Guenther

MatthewStier · November 24, 2019, 7:56pm

Thanks for the hints so far.

So far I have have updated the trap, and had it call a cleanup function, and within the function, I had it write a status line to a log file, using the ‘date’ command, to get a date and time stamp, process ID, and the number of seconds the script had been running, and return value the function had received.

I then reused the ‘data’ command, and injected it after each block of code, using the function of the previous block as a label for that date command.

The lead me to notice a random large jump in ‘plugins’. A few more ‘date’ lines inside the ‘plugins’ code revealed that the mk_inventory plugin was the culprit. I moved it from /usr/lib/check_mk_agent/plugins/ to …/plugins/300/, and that resolved 99% of my problems.

Now on to the last 1%.

Note:
With mk_inventory in plugins/ it was being running within the check. Most of the time, this wasn’t an issue, because it would only run to completion, once every four hours, and on most systems, it could run to completion within the 60 seconds allocated to running checks. On the few combinations of systems and loads, it could not, we got a failed service check, and thus a systemd failed service warning. By moving it to plugins/300/ we moved it out of the time window, and thus, fewer failures.

Since the mk_inventory script only ran to completion once every 4 hours (default), running it cached, with a 5 minute window, is an acceptable hit.

Racke · November 24, 2019, 8:26pm

I think the problem is that you are running check_mk from xinetd and thus check_mk.socket fails.
If you don’t want to use check_mk.socket via systemd, disable it.

systemctl disable check_mk.socket

should do the trick.

Disabled systemd units will shown in the systemd service summary, but not flagged as CRIT.

Regards
Racke

grasch · November 24, 2019, 9:13pm

Hi,
no. xinetd is not installed at this system but check_mk installs a default configuration for
xinetd? Strange…

It’s systemd only for now and for future use.

Regards
Guenther

Racke · November 25, 2019, 8:38am

Hello Guenther,

can you please show the output of the following commands as Ronny suggested:

systemctl status check_mk.socket
systemd-cgls -au system-check_mk.slice

Regards
Racke

MatthewStier · November 25, 2019, 10:19pm

systemctl status check_mk.service: failed

I have to do a 'systemctl -a | grep check_mk and select one of the failed entries.

[mstier@cinscplp106 ~]$ sudo systemctl status check_mk@26238-167.254.216.7:6556-168.127.18.5:49990.service
● check_mk@26238-167.254.216.7:6556-168.127.18.5:49990.service - Check_MK (168.127.18.5:49990)
   Loaded: loaded (/etc/systemd/system/check_mk@.service; static; vendor preset: disabled)
   Active: failed (Result: timeout) since Mon 2019-11-25 10:48:01 CST; 5h 26min ago
  Process: 43313 ExecStart=/usr/bin/check_mk_agent (code=killed, signal=TERM)

Nov 25 10:46:30 cinscplp106 systemd[1]: Starting Check_MK (168.127.18.5:4999....
Nov 25 10:48:01 cinscplp106 systemd[1]: check_mk@26238-167.254.216.7:6556-16....
Nov 25 10:48:01 cinscplp106 systemd[1]: Failed to start Check_MK (168.127.18....
Nov 25 10:48:01 cinscplp106 systemd[1]: Unit check_mk@26238-167.254.216.7:65....
Nov 25 10:48:01 cinscplp106 systemd[1]: check_mk@26238-167.254.216.7:6556-16....
Hint: Some lines were ellipsized, use -l to show in full.
[mstier@cinscplp106 ~]$ 

systemd-cgls -au system-check_mk.slice: failed

Invalid option '-u'

If I drop the ‘-u’ option, I get:

Failed to list cgroup tree system-check_mk.slice: Invalid argument
system-check_mk.slice:

MatthewStier · November 25, 2019, 10:26pm

FYI:

Most of my systems are Oracle Linux, and all of these problem systems are Oracle Linux 7. (Think Red Hat 7)

Also, I have just completed a quarterly patch cycle, so they are Update 7.

grasch · November 26, 2019, 7:31am

Matthew,
you have still problems with your …/plugins/300 workaround? I’m testing it too in our
environment

Regards
Guenhter

andreas-doehler · November 26, 2019, 9:28am

systemctl status check_mk.service

This will not work - only like mentioned before - systemctl status check_mk.socket
check_mk.service is not started or activated.

MatthewStier · November 27, 2019, 4:17pm

All of the ‘systemd’ fails I’m currently fighting are timeouts (agent runs greater than 60 seconds), and no matter what I try, they will still exist.

I did get an e-mail stating that the systemd check does have a filter.

Go to ‘WATO - Configuration’; ‘Host & Service Parameters’; Search for ‘systemd’; Select ‘Systemd Services’; and add a filter for ‘check_mk@*’

fmonts · January 9, 2020, 3:52pm

Same here on Ubuntu, I think this is a bug of the agent, and adding a filter is only a workaround, the services should not fail in the first place.

Hector · January 15, 2020, 1:55pm

Hi, same problem here since switching to SystemD Socket instead of xinetd on Debian 8-10.

I’ve minimized those isses with an increased “Agent TCP connect timeout” to 35s. Nonetheless it happens here and there:

Jan 15 14:21:05 hostname systemd[1]: Starting Check_MK ([::1]:35266)…
Jan 15 14:22:06 hostname systemd[1]: Starting Check_MK ([::1]:35280)…
Jan 15 14:22:09 hostname systemd[1]: Started Check_MK ([::1]:35280).
Jan 15 14:22:35 hostname systemd[1]: check_mk@16243-::1:6556-::1:35266.service start operation timed out. Terminating.
Jan 15 14:22:35 hostname systemd[1]: Failed to start Check_MK ([::1]:35266).
Jan 15 14:22:35 hostname systemd[1]: Unit check_mk@16243-::1:6556-::1:35266.service entered failed state.
Jan 15 14:27:06 hostname systemd[1]: Starting Check_MK ([::1]:35293)…
Jan 15 14:28:06 hostname systemd[1]: Starting Check_MK ([::1]:35303)…
Jan 15 14:28:09 hostname systemd[1]: Started Check_MK ([::1]:35303).

The one failed connection remains as a failed systemd process unless it’s “reset-failed”.

I think there is a correlation with apt-daily tasks in the early morning or in general a (very) high load. I’ve already tried to ignore the output of this process, but it didn’t helped

root@hostname:~# cat /etc/systemd/system/check_mk@.service.d/override.conf
[Service]
ExecStart=
ExecStart=-/usr/bin/check_mk_agent
(see systemd.service Prefix -)

I don’t care for a single check_mk call running into timout and would like to get a solution here, too

destracke · January 16, 2020, 11:54am

We see the mentioned issues with multiple hosts on checkmk Enterprise 1.6 p6, e.g. timeouts with multiple spawned

check-mk-agent@*.service

The checkmk agent socket check-mk-agent.socket is active and listening.

If I wanted to dig deeper into the problem, where could I start?

Best regards
Dennis

MatthewStier · January 16, 2020, 2:05pm

First you need to determine where in the code, the slow down is.

I copied the agent, and sprinkled in a few

echo "A $SECONDS"

at the end of each block of code to see which block causes the biggest increase in time. (Change the letter, so each invocation has it’s own label)

JPT · February 13, 2020, 8:15am

Same Problem here, 1.6.0p6 and multiple Linuxoides.
Has someone found any Solution except a Filter?

_rb · February 13, 2020, 3:04pm

should be fixed with werk # 10710

https://checkmk.de/check_mk-werks.php?werk_id=10710