CheckMK loses connection to 1 particular server

Steven1 · November 22, 2021, 2:07pm

Hi

I have a recurring issue with CheckMK losing connection 1 one particular server. (rest works fine)
When I do a reinstall of the agent on the target server, it works for a period of time (not always the same duration) and then it stops reacting.

Any idea what could create such behaviour?
The target server is a “Ubuntu Linux 18.04.6” (fully updated)

No firewall is active
ping works fine in both directions
Purpose of the Linux server is VPN end point (if that is relevant)

louis · November 22, 2021, 2:52pm

Hi,

How are you connecting to the server? Is that via xinetd, the systemd socket or ssh? If it’s one of the first two you may want to try and restart that service. A reinstall of the agent should be a bit drastic solution.

Also, see if you can get a connection from the Check_MK server to the client on port 6556:

nc -v <your server IP> 6556

If that fails, try it from the client, using localhost for the server IP.
Louis.

Steven1 · November 22, 2021, 3:06pm

I am connecting to the target server via systemd.

How do I restart the service?

the nc command works:

louis · November 22, 2021, 3:12pm

Hi Steven,

You can restart the service by running:

systemctl restart check_mk.socket

Also, you may want to check if there are hanging agent processes:

ps auxwww | grep check_mk

If the last command returns several check_mk_agent processes you may want to check if perhaps some took a very long time to complete and are using up the maximum number of available sockets for the check_mk agent (I believe the default is 3).

systemctl --failed

should have no output.

Kind regards,
Louis.

Steven1 · November 22, 2021, 3:17pm

I restarted the service, which worked fine.
I don’t see any hanging processes with the command you provided
and for the last command, I get this output:

louis · November 22, 2021, 3:19pm

Ah, that explains it probably. It’s taking up all the sockets. See if you can do a restart of those processes

systemctl restart check_mk@178........service

If all else fails:

systemctl reset-failed

Steven1 · November 22, 2021, 3:28pm

When I try to restart the service:

When I use the command “systemctl reset-failed”

Where does the problem begin? Is it low resources of the VM itself?

louis · November 22, 2021, 3:31pm

Good question, to which I unfortunately don’t have the answer. But does the reset-failed solve the issue?

Steven1 · November 22, 2021, 3:34pm

Well, it hasn’t started failing again.
As I stated in the first post, sometimes it takes 8 days before the communications fails and other times it takes 4 hours.

I will test this again when it fails again and I will report back then.

Thanks again for the help!

robin.gierse · November 24, 2021, 12:32pm

@Steven1 what Checkmk version are you using? I suggest updating to the latest patch release. There was a bug in some version where this could happen.

Steven1 · November 24, 2021, 1:13pm

@robin.gierse 2.0.0p1, is there an update available?

robin.gierse · November 24, 2021, 1:16pm

The current patch release is 2.0.0p16 as of today.
You might want to keep an eye on Announcements - Checkmk Community for this kind of information.

edit:
@fayepal pointed out two more options:

Go to Topics tagged checkmk-release, on the upper right click the bell icon and choose ‘Watching’. That will send an e-mail for every new topic.
If you are a mail type, go to Checkmk-announce Info Page.

Steven1 · November 24, 2021, 1:20pm

@robin.gierse Is there a straight cut upgrade path / guide?
Could you point me in the right direction please?

robin.gierse · November 24, 2021, 1:40pm

Steven1 · November 24, 2021, 1:52pm

thanks @robin.gierse

robin.gierse · November 24, 2021, 1:53pm

Like and subscribe, if you appreciate my stuff.

Steven1 · November 24, 2021, 1:53pm

I thought that was a youtube thing

robin.gierse · November 24, 2021, 1:55pm

I see you got the reference, noice!

Steven1 · November 25, 2021, 9:33am

Thanks @robin.gierse

system · November 25, 2022, 9:34am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.