Connectivity Issues with the Agent

Hello all,

we have CheckMK with several customers as a service provider in use.
Today we had the case that at 2 of several customers at the same time the agent no longer responds.
Here are both Linux and Windows Agents on different OS versions in use.
If the agent is queried directly via “check_mk_agent.exe”, the agent responds immediately and correctly.
If you try to retrieve data via telnet localhost 6556, the connection is terminated before it is completed.

CheckMK
<<check_mk>> Version: 2.0.0p4 BuildDate: May 11 2021 AgentOS: windows Hostname: HOSTNAME Architecture: 64bit WorkingDirectory: C:\Program Files (x86)\checkmk\service ConfigFile: C:\Program Files (x86)\checkmk\service\check_mk.yml LocalConfigFile: C:\ProgramData\checkmk\agent\check_mk.user.yml AgentDirectory: C:\Program Files (x86)\checkmk\service PluginsDirectory: C:\ProgramData\checkmk\agent\plugins StateDirectory: C:\ProgramData\checkmk\agent\state ConfigDirectory: C:\ProgramData\checkmk\agent\config TempDirectory: C:\ProgramData\checkmk\agent\tmp LogDirectory: C:\ProgramData\checkmk\agent\log SpoolDirectory: C:\ProgramData\checkmk\agent\spool LocalDirectory: C:\ProgramData\checkmk\agent\local OnlyFrom:

Windows Firewall is disabled.
No plugins are configured. The vanilla Agent is installed.

On the CheckMK instance we have the following output on the agent check:
[agent] Fetcher for host “HDC” timed out after 120 secondsCRIT.
[agent] Communication failed: [Errno 104] Connection reset by peerCRIT

Greetings from Germany
NCITS

Hi,
i am guessing you use Enterprise or Managed edition not raw right?
Did you try using : Service check timeout (Microcore) and set more minutes for check? I fixed same problems on my site but on hosts using SNMP not agents, maybe it will help you also.

Best regards,
JF

For the Windows agent i would first take a look at the agent log file inside “C:\ProgramData\checkmk\agent\log\check_mk.log”. It is very possible that you see something there why the agent is not finished correctly.

Thanks for this tip.
Yes we use the Managed Edition. You are correct.
But the Problem is not the CheckMK Server.
If i try to telnet localhost 6556 on the affected System it gets me an incomplete Output, if any at all.
There is something wrong with the Agent/Client System.

Best regards,
NCITS

Hi Andreas,

sadly i cannot upload the log here right now (newly created Account)
but i did not see anything suspicios when i looked into this file.

Is there something to look out for?
The Log is almost the same Output as the Command “C:\Program Files (x86)\checkmk\service>check_mk_agent.exe check -self” an it ends with the following:

perf: Answer is ready in [2443] milliseconds
Send [30752] bytes of data
Received 30752 bytes
Leaving testing thread

That are the first 6 & the last 3 lines of the Output of the log file if i connect via telnet:

2021-11-05 09:02:31.832 [srv 6812] Connected from '127.0.0.1' ipv6 :false -> queue
2021-11-05 09:02:31.834 [srv 6812] Connected from '127.0.0.1' ipv6:false <- queue
2021-11-05 09:02:31.835 [srv 6812] [Warn ] OHM file 'C:\ProgramData\checkmk\agent\bin\OpenHardwareMonitorCLI.exe' is not found
2021-11-05 09:02:31.836 [srv 6812] Allowed Extensions: [checkmk.py,py,exe,bat,vbs,cmd,ps1]
2021-11-05 09:02:31.837 [srv 6812] [Trace] Left [36] files to execute
2021-11-05 09:02:31.845 [srv 6812] [Trace] Left [0] files to execute in 'plugins'

.....

2021-11-05 09:02:34.331 [srv 6812] Received [128] bytes from 'skype'
2021-11-05 09:02:34.333 [srv 6812] perf: Answer is ready in [2485] milliseconds
2021-11-05 09:02:34.334 [srv 6812] Send [33275] bytes of data

Thanks for the Help
NCITS

If these lines are the same if you query the agent directly from your monitoring server then something between client and monitoring server impact the communication.
2,4 seconds and 30k data looks normal

I see only one problem. You agent is not configured correctly. There are messages that it tries to use the OHM monitoring but it not exists on this server and it tries to query the “skype” performance counters without finding something. I would advice that you only activate sections needed on this machine or disable explicitly the unneeded ones.

On my Windows machines the global section with enabled sections looks like this.

global:
  enabled: true
  async_script_execution: parallel
  sections:
    - check_mk
    - spool
    - plugins
    - local
    - winperf
    - uptime
    - systemtime
    - df
    - mem
    - services
    - dotnet_clrmemory
    - wmi_webservices
    - wmi_cpuload
    - ps
    - fileinfo
    - logwatch

All other sections are disabled if not needed.

Have you configured the IP whitelist for the agent? That would explain why you get a connection but no information.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.