Dear CheckMK community,
I don’t know exactly why, but the systemd CheckMK Agent fails regularly on many of our different servers.
Every day when I get into work and open up CheckMK Monitoring, there’s at the very least 1 host who says “no connection to CheckMK agent”.
When I log into these servers, I see something like this:
systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service loaded failed failed CheckMK (172.16.14.9:37574)
● check-mk-agent@1684-172.16.14.34:6556-172.16.14.9:42292.service loaded failed failed CheckMK (172.16.14.9:42292)
● check-mk-agent@1685-172.16.14.34:6556-172.16.14.9:42966.service loaded failed failed CheckMK (172.16.14.9:42966)
Journalctl
sudo journalctl -u check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574
-- Logs begin at Thu 2021-07-29 12:52:56 CEST, end at Thu 2021-08-26 09:29:15 CEST. --
Aug 25 20:05:52 pixelmap systemd[1]: Starting CheckMK (172.16.14.9:37574)...
Aug 25 20:07:17 pixelmap systemd[1]: check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service: start operation timed out. Terminating.
Aug 25 20:07:17 pixelmap systemd[1]: check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service: Failed with result 'timeout'.
Aug 25 20:07:17 pixelmap systemd[1]: Failed to start CheckMK (172.16.14.9:37574).
Systemctl
systemctl status check-mk-agent.socket
● check-mk-agent.socket - CheckMK Agent Socket
Loaded: loaded (/etc/systemd/system/check-mk-agent.socket; enabled; vendor preset: enabled)
Active: active (listening) since Mon 2021-08-23 09:25:01 CEST; 3 days ago
Listen: [::]:6556 (Stream)
Accepted: 1690; Connected: 0; Refused: 377
Tasks: 0 (limit: 120618)
Memory: 3.5M
CGroup: /system.slice/check-mk-agent.socket
Temporary solution
Running sudo systemctl reset-failed
and afterwards manually triggering sudo check_mk_agent -v
seems to fix the problem for now.
But this is a “temporary fix” which I have to apply every time the CheckMK Agent fails.
Why is that?
Can somebody point me in the right direction so I know where to look for the reason behind this?
I found this forum post and this one.
The first one speaks of a problem visible in audit.log
, but I could not find this file on our servers.
Both seem to talk about CheckMK 1.6, which we are not using anymore.
Our Setup
CheckMK 2.0.0p9 (CEE) on Ubuntu 20.04 VM
The problem seems to happen in our LAN servers as well as on our WAN servers.
Greetings,
pixelpoint