I’ve upgraded all my sites to 1.6.0p5, and updated the all of my Oracle Linux systems to the latest agent, which includes a ‘Systemd Service Summary’ service.
Great idea, but 99.9% of the warnings are from the ‘check_mk’ service on my EL7 systems.
Any hints?
I’ve already added a ‘trap “exit 0” EXIT’ as the second line, to ensure the script always returns and exit code of zero.
And periodically run an Ansible playbook against all systems to reset failed services periodically. (ansible hostnames -ba ‘systemctl reset-failed check_mk*’)
Hi Ronny,
the only thing I’ve found was an entry at /var/log/messages:
Nov 23 16:34:35 systemd: Starting Check_MK (<ip-address of check-mk-server>:33118)...
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service start operation timed out. Terminating.
Nov 23 16:36:50 systemd: Failed to start Check_MK (<ip-address of check-mk-server>:33118).
Nov 23 16:36:50 systemd: Unit check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service entered failed state.
Nov 23 16:36:50 systemd: check_mk@914-10.2.16.144:6556-<ip-address of check-mk-server>:33118.service failed.
Nov 23 16:36:50 systemd: Starting Check_MK (<ip-address of check-mk-server>:43208)...
Nov 23 16:36:55 systemd: Started Check_MK (<ip-address of check-mk-server>:43208).
Check-mk is running normally, but the systemd check goes critical after that. Any suggestions
for configuring the agent under /etc/xinet.d/check_mk ?
Could it be the case to set wait = yes in /etc/xinet.d/check_mk. It’s a single call per check intervall
right?
So far I have have updated the trap, and had it call a cleanup function, and within the function, I had it write a status line to a log file, using the ‘date’ command, to get a date and time stamp, process ID, and the number of seconds the script had been running, and return value the function had received.
I then reused the ‘data’ command, and injected it after each block of code, using the function of the previous block as a label for that date command.
The lead me to notice a random large jump in ‘plugins’. A few more ‘date’ lines inside the ‘plugins’ code revealed that the mk_inventory plugin was the culprit. I moved it from /usr/lib/check_mk_agent/plugins/ to …/plugins/300/, and that resolved 99% of my problems.
Now on to the last 1%.
Note:
With mk_inventory in plugins/ it was being running within the check. Most of the time, this wasn’t an issue, because it would only run to completion, once every four hours, and on most systems, it could run to completion within the 60 seconds allocated to running checks. On the few combinations of systems and loads, it could not, we got a failed service check, and thus a systemd failed service warning. By moving it to plugins/300/ we moved it out of the time window, and thus, fewer failures.
Since the mk_inventory script only ran to completion once every 4 hours (default), running it cached, with a 5 minute window, is an acceptable hit.
I think the problem is that you are running check_mk from xinetd and thus check_mk.socket fails.
If you don’t want to use check_mk.socket via systemd, disable it.
systemctl disable check_mk.socket
should do the trick.
Disabled systemd units will shown in the systemd service summary, but not flagged as CRIT.
Hi, same problem here since switching to SystemD Socket instead of xinetd on Debian 8-10.
I’ve minimized those isses with an increased “Agent TCP connect timeout” to 35s. Nonetheless it happens here and there:
Jan 15 14:21:05 hostname systemd[1]: Starting Check_MK ([::1]:35266)…
Jan 15 14:22:06 hostname systemd[1]: Starting Check_MK ([::1]:35280)…
Jan 15 14:22:09 hostname systemd[1]: Started Check_MK ([::1]:35280).
Jan 15 14:22:35 hostname systemd[1]: check_mk@16243-::1:6556-::1:35266.service start operation timed out. Terminating.
Jan 15 14:22:35 hostname systemd[1]: Failed to start Check_MK ([::1]:35266).
Jan 15 14:22:35 hostname systemd[1]: Unit check_mk@16243-::1:6556-::1:35266.service entered failed state.
Jan 15 14:27:06 hostname systemd[1]: Starting Check_MK ([::1]:35293)…
Jan 15 14:28:06 hostname systemd[1]: Starting Check_MK ([::1]:35303)…
Jan 15 14:28:09 hostname systemd[1]: Started Check_MK ([::1]:35303).
The one failed connection remains as a failed systemd process unless it’s “reset-failed”.
I think there is a correlation with apt-daily tasks in the early morning or in general a (very) high load. I’ve already tried to ignore the output of this process, but it didn’t helped
root@hostname:~# cat /etc/systemd/system/check_mk@.service.d/override.conf
[Service]
ExecStart=
ExecStart=-/usr/bin/check_mk_agent
(see systemd.service Prefix -)
I don’t care for a single check_mk call running into timout and would like to get a solution here, too