CheckMK Agent Systemd fails regularly

Dear CheckMK community,

I don’t know exactly why, but the systemd CheckMK Agent fails regularly on many of our different servers.
Every day when I get into work and open up CheckMK Monitoring, there’s at the very least 1 host who says “no connection to CheckMK agent”.

When I log into these servers, I see something like this:

systemctl --failed
  UNIT                                                            LOAD   ACTIVE SUB    DESCRIPTION                
● check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service loaded failed failed CheckMK (172.16.14.9:37574)
● check-mk-agent@1684-172.16.14.34:6556-172.16.14.9:42292.service loaded failed failed CheckMK (172.16.14.9:42292)
● check-mk-agent@1685-172.16.14.34:6556-172.16.14.9:42966.service loaded failed failed CheckMK (172.16.14.9:42966)

Journalctl

sudo journalctl -u check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574
-- Logs begin at Thu 2021-07-29 12:52:56 CEST, end at Thu 2021-08-26 09:29:15 CEST. --
Aug 25 20:05:52 pixelmap systemd[1]: Starting CheckMK (172.16.14.9:37574)...
Aug 25 20:07:17 pixelmap systemd[1]: check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service: start operation timed out. Terminating.
Aug 25 20:07:17 pixelmap systemd[1]: check-mk-agent@1674-172.16.14.34:6556-172.16.14.9:37574.service: Failed with result 'timeout'.
Aug 25 20:07:17 pixelmap systemd[1]: Failed to start CheckMK (172.16.14.9:37574).

Systemctl

systemctl status check-mk-agent.socket
● check-mk-agent.socket - CheckMK Agent Socket
     Loaded: loaded (/etc/systemd/system/check-mk-agent.socket; enabled; vendor preset: enabled)
     Active: active (listening) since Mon 2021-08-23 09:25:01 CEST; 3 days ago
     Listen: [::]:6556 (Stream)
   Accepted: 1690; Connected: 0;   Refused: 377
      Tasks: 0 (limit: 120618)
     Memory: 3.5M
     CGroup: /system.slice/check-mk-agent.socket

Temporary solution
Running sudo systemctl reset-failed and afterwards manually triggering sudo check_mk_agent -v seems to fix the problem for now.
But this is a “temporary fix” which I have to apply every time the CheckMK Agent fails.

Why is that?
Can somebody point me in the right direction so I know where to look for the reason behind this?

I found this forum post and this one.
The first one speaks of a problem visible in audit.log, but I could not find this file on our servers.
Both seem to talk about CheckMK 1.6, which we are not using anymore.

Our Setup
CheckMK 2.0.0p9 (CEE) on Ubuntu 20.04 VM

The problem seems to happen in our LAN servers as well as on our WAN servers.

Greetings,
pixelpoint

This is a ‘systemd’ issue, not a CMK issue.

The active check that does the magic, will only stay connected for 60 seconds.

Theses error messages are from active checks that lasted longer than 60 seconds, and thus were terminated abnormally. Thus the ‘failed’ state.

The only solution is to switch to xinetd. (no such warnings)

If you want to remain with systemd managing the socket, spend some time figuring out why check-mk-agent is taking as long as it does, and shorten the action. You can reduce the failings, but not eliminate them.

Thank you for taking the time to help me with my problem.

When I execute sudo check_mk_agent -v on the Host, the execution of the CheckMK Agent takes nowhere near 60s and many of my hosts with failed CheckMK Agents are inside our own LAN.
Better even, they are hosted on the same Hypervisor as the CheckMK server itself.

Given these facts, I do not quite understand how the CheckMK Active Check could have to wait for more than 60s.

Am I missing something here?

Is it because xinetd does not impose a 60s max wait_for_answer time or because the xinetd CheckMK Agent also fails but xinetd does not throw a warning?

Thanks a bunch for helping me understand :slight_smile:

Best regards,
pixelpoint

You are not getting a warning for those instances that pass, but the ones that fail, due to taking to long.

Add the following line at the end of the /usr/bin/check-mk-agent script to record how long each run took, and monitor the output file for large values.

  echo “$$ Z ${SECOND}” >> /tmp/cmk

Once you determine there is a problem, you can add lines at different points in the script, (changing the letter) to attempt to identify which segment of the script is causing the problem for you.

Thank you very much, MatthewStier!

I’ll be sure to append this to the /usr/bin/check-mk-agent to find the reason behind this.

Best regards,
pixelpoint

I added echo “$$ Z ${SECOND}” >> /tmp/cmk to /usr/bin/check_mk_agent on a monitored server.
It does echo the PID and Z into /tmp/cmk but not the time.

After looking at the file (at least with grep) I cannot find any variable ${SECONDS}.

There is no file/script check-mk-agent (only /usr/bin/check_mk_agent) to be found on any of my monitored servers or even the Monitoringserver itself.

Have I added this to the wrong script or did I miss something?

What version of ‘bash’ is this system running?

Have you tried running “echo ${SECONDS}” on the command line?

SECONDS is dynamic variable, that will report the number of seconds since the script started.

Did you use ‘SECOND’ or ‘SECONDS’?

Oops, thank you :sweat_smile:

Will this command not work with older versions of bash?
I just tested in on our oldest system, seems to work there too.

I hope you don’t mind but I want to clarify on that.
Why are there no such warnings under xinetd?
Does xinetd not throw an error upon taking too long or is there no “too long” with xinetd?
What would happen if a plugin or local check suddenly takes 2 minutes to execute when using xinetd?

Thank you for your time.

Best regards,
pixelpoint

I’m not aware of when it was introduced to bash. I know it works in Bash 3.00.15 installed on some of my RHEL4 systems.

As I mentioned, this a ‘systemd’ warning. Xinetd does not make such checks. It may log them, but does not set system faults based upon them.

Same thing that is happening under systemd. The server disconnects, and the client should exits. It may be logged, if you setup your logging to catch it.

Thank you very much.
I will mark your answer as solution :slight_smile:

Best regards,
pixelpoint

It’s more a testing method, than a solution.

A further suggest: Add an ‘A’ line at line 2, and a ‘M’ line between checks about half way through the script. (between blocks of check code)

If you only have ‘Z’ line at the end of the script, and if it runs too long, and gets terminated, you will have no evidence that the script was run at all.

You are right, but as the topics “problem” seems nothing more than a symptom of different underlying problems, I guess a way to diagnose what exactly went wrong is the best “solution” there is.

Example:
I already identified the plugin mk_docker.py to be the problem on a few machines.
The reason for this was that our devs didn’t clear the docker cache on some dev VMs so there were countless containers and volumes with about 150 GB lying around.
The docker plugin tried to identify how much space could be reclamed and that took more than 60 seconds, which in turn resulted in the crash.

As for your suggestion:
Thanks, that is exactly what I had in mind :slight_smile:

Best regards,
pixelpoint

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.