Two instances with the same Python issue - Error handling client : [Errno 104] Connection reset by peer

Elena.Dzhordzhilova · December 20, 2021, 10:21am

Dear Community,

I have two CheckMK instances and today the following issue with Python occurred on both. The interface showed error :

[CheckMK Instance name] Livestatus Error Unhandled exception: 400: Site connection not initiated (Heartbeat timeout after 2.0 sec).

Inside the mkeventd.log file:
"
… StatusServer] Error handling client : [Errno 104] Connection reset by peer
Traceback (most recent call last):
File “/omd/sites/[Instance name]/lib/python/cmk/ec/main.py”, line 3031, in serve
“”)
File “/omd/sites/[Instance name]/lib/python/cmk/ec/main.py”, line 3051, in handle_client
for query in Queries(self, client_socket, self._logger):
File “/omd/sites/[Instance name]/lib/python/cmk/ec/main.py”, line 2511, in next
data = self._socket.recv(4096)
error: [Errno 104] Connection reset by peer
"

Both instances are on different servers. I checked the firewall, but the issue was not caused by it. In WATO > Distributed Monitoring everything looked fine, no errors appeared there. I also checked the logs from site backups and activating different changes, but no actions were preformed there before the issue.

In liveproxyd.log events are saved after I restarted the instance, but without any specific information:

2021-12-20 08:35:22,842 [20] [cmk.liveproxyd.(3445011).Manager] Got signal 15. Initiating shutdown…
2021-12-20 08:35:22,863 [20] [cmk.liveproxyd.(3445011).Manager] Good bye.
2021-12-20 08:35:22,865 [20] [cmk.liveproxyd] Successfully shut down.
2021-12-20 08:35:47,909 [20] [cmk.liveproxyd] ----------------------------------------------------------
2021-12-20 08:35:47,910 [20] [cmk.liveproxyd] Livestatus Proxy-Daemon (1.6.0p19) starting…
2021-12-20 08:35:47,911 [20] [cmk.liveproxyd] Configured 0 sites

Could you please give me a hint from where the problem was caused? After restart it fixes, but it is not a long-term solution.

Thank you in advance!

Best Regards,
Elena

tosch · December 20, 2021, 12:31pm

Hi @Elena.Dzhordzhilova

This is mostly a sign for something network related. Can you check if the configured ports for liveproxy are reachable from both instances vise versa? (if running on both instances)

Can you also please provide us the exact version and edition you are using?

robin.gierse · December 20, 2021, 1:23pm

This could be an OOM victim. Can you check the hardware resources on your checkmk server? They might have been exhausted, causing parts of checkmk to be killed by the kernel.

Elena.Dzhordzhilova · December 20, 2021, 2:38pm

Hello, thank you both for your help!

I checked the logs of the machines and in messages I see that Python process was killed due to lack of SWAP memory. I will investigate what caused this load.

Thank you once again!

robin.gierse · December 20, 2021, 3:59pm

If you are using SWAP, you have too little RAM.

Also, are both instances running on the same server?

You also might want to tweak the Apache settings in the global settings, so they do not eat up all your memory when there are too many requests from clients.

Elena.Dzhordzhilova · December 20, 2021, 5:01pm

Thank you for the advise! I decreased the number of Apache processes and will monitor it the next days!

The instances are on different servers. I investigated a bit more the service availability of both instances and I saw that one of them had alarm for Memory, but the other one had only critical event for OMD performance. Both servers have the same records inside the logs.

system · December 20, 2022, 5:01pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.