CheckMK loses connection to 1 particular server

Hi

I have a recurring issue with CheckMK losing connection 1 one particular server. (rest works fine)
When I do a reinstall of the agent on the target server, it works for a period of time (not always the same duration) and then it stops reacting.
image

Any idea what could create such behaviour?
The target server is a “Ubuntu Linux 18.04.6” (fully updated)

  • No firewall is active
  • ping works fine in both directions
    Purpose of the Linux server is VPN end point (if that is relevant)

Hi,

How are you connecting to the server? Is that via xinetd, the systemd socket or ssh? If it’s one of the first two you may want to try and restart that service. A reinstall of the agent should be a bit drastic solution.

Also, see if you can get a connection from the Check_MK server to the client on port 6556:

nc -v <your server IP> 6556

If that fails, try it from the client, using localhost for the server IP.
Louis.

I am connecting to the target server via systemd.

How do I restart the service?

the nc command works:
image

Hi Steven,

You can restart the service by running:

systemctl restart check_mk.socket

Also, you may want to check if there are hanging agent processes:

ps auxwww | grep check_mk

If the last command returns several check_mk_agent processes you may want to check if perhaps some took a very long time to complete and are using up the maximum number of available sockets for the check_mk agent (I believe the default is 3).

systemctl --failed

should have no output.

Kind regards,
Louis.

I restarted the service, which worked fine.
I don’t see any hanging processes with the command you provided
and for the last command, I get this output:

Ah, that explains it probably. It’s taking up all the sockets. See if you can do a restart of those processes

systemctl restart check_mk@178........service

If all else fails:

systemctl reset-failed

When I try to restart the service:

When I use the command “systemctl reset-failed”
image

Where does the problem begin? Is it low resources of the VM itself?

Good question, to which I unfortunately don’t have the answer. But does the reset-failed solve the issue?

Well, it hasn’t started failing again.
As I stated in the first post, sometimes it takes 8 days before the communications fails and other times it takes 4 hours.

I will test this again when it fails again and I will report back then.

Thanks again for the help!

@Steven1 what Checkmk version are you using? I suggest updating to the latest patch release. There was a bug in some version where this could happen.

@robin.gierse 2.0.0p1, is there an update available?

The current patch release is 2.0.0p16 as of today. :slight_smile:
You might want to keep an eye on Announcements - Checkmk Community for this kind of information.

edit:
@fayepal pointed out two more options:

  1. Go to Topics tagged checkmk-release, on the upper right click the bell icon and choose ‘Watching’. That will send an e-mail for every new topic.
  2. If you are a mail type, go to Checkmk-announce Info Page.

@robin.gierse Is there a straight cut upgrade path / guide?
Could you point me in the right direction please?

thanks @robin.gierse

Like and subscribe, if you appreciate my stuff. :wink:

I thought that was a youtube thing :smiley:

1 Like

I see you got the reference, noice! :slight_smile:

image
Thanks @robin.gierse

1 Like