NetCologne-ITS
(Servicedesk - NetCologne IT Services GmbH)
1
Hello all,
we have CheckMK with several customers as a service provider in use.
Today we had the case that at 2 of several customers at the same time the agent no longer responds.
Here are both Linux and Windows Agents on different OS versions in use.
If the agent is queried directly via “check_mk_agent.exe”, the agent responds immediately and correctly.
If you try to retrieve data via telnet localhost 6556, the connection is terminated before it is completed.
Windows Firewall is disabled.
No plugins are configured. The vanilla Agent is installed.
On the CheckMK instance we have the following output on the agent check:
[agent] Fetcher for host “HDC” timed out after 120 secondsCRIT.
[agent] Communication failed: [Errno 104] Connection reset by peerCRIT
Hi,
i am guessing you use Enterprise or Managed edition not raw right?
Did you try using : Service check timeout (Microcore) and set more minutes for check? I fixed same problems on my site but on hosts using SNMP not agents, maybe it will help you also.
For the Windows agent i would first take a look at the agent log file inside “C:\ProgramData\checkmk\agent\log\check_mk.log”. It is very possible that you see something there why the agent is not finished correctly.
NetCologne-ITS
(Servicedesk - NetCologne IT Services GmbH)
4
Thanks for this tip.
Yes we use the Managed Edition. You are correct.
But the Problem is not the CheckMK Server.
If i try to telnet localhost 6556 on the affected System it gets me an incomplete Output, if any at all.
There is something wrong with the Agent/Client System.
Best regards,
NCITS
NetCologne-ITS
(Servicedesk - NetCologne IT Services GmbH)
5
Hi Andreas,
sadly i cannot upload the log here right now (newly created Account)
but i did not see anything suspicios when i looked into this file.
Is there something to look out for?
The Log is almost the same Output as the Command “C:\Program Files (x86)\checkmk\service>check_mk_agent.exe check -self” an it ends with the following:
perf: Answer is ready in [2443] milliseconds
Send [30752] bytes of data
Received 30752 bytes
Leaving testing thread
That are the first 6 & the last 3 lines of the Output of the log file if i connect via telnet:
2021-11-05 09:02:31.832 [srv 6812] Connected from '127.0.0.1' ipv6 :false -> queue
2021-11-05 09:02:31.834 [srv 6812] Connected from '127.0.0.1' ipv6:false <- queue
2021-11-05 09:02:31.835 [srv 6812] [Warn ] OHM file 'C:\ProgramData\checkmk\agent\bin\OpenHardwareMonitorCLI.exe' is not found
2021-11-05 09:02:31.836 [srv 6812] Allowed Extensions: [checkmk.py,py,exe,bat,vbs,cmd,ps1]
2021-11-05 09:02:31.837 [srv 6812] [Trace] Left [36] files to execute
2021-11-05 09:02:31.845 [srv 6812] [Trace] Left [0] files to execute in 'plugins'
.....
2021-11-05 09:02:34.331 [srv 6812] Received [128] bytes from 'skype'
2021-11-05 09:02:34.333 [srv 6812] perf: Answer is ready in [2485] milliseconds
2021-11-05 09:02:34.334 [srv 6812] Send [33275] bytes of data
If these lines are the same if you query the agent directly from your monitoring server then something between client and monitoring server impact the communication.
2,4 seconds and 30k data looks normal
I see only one problem. You agent is not configured correctly. There are messages that it tries to use the OHM monitoring but it not exists on this server and it tries to query the “skype” performance counters without finding something. I would advice that you only activate sections needed on this machine or disable explicitly the unneeded ones.
On my Windows machines the global section with enabled sections looks like this.
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.