Can't communicate with the agent

Hi,

I’ve been trying to solve this since yesterday. I am out of ideas…
This is a fairly new install of checkmk server 2.2.0p6 RAW on Rocky Linux 9.2 monitoring several snmp based devices with no issues. However this is the first installation of the agent on a Rocky Linux 9.2 as well. The output of tcpdump on both servers shows that packets are flowing through, however I can’t get any result when trying connection tests on the server. Details provided below:

CMK version:
### 2.2.0p6
OS version:
Rocky Linux release 9.2 (Blue Onyx)

Error message when trying to run connection tests:
API Error:Error running automation call <tt>diag-host</tt>: Your request timed out after 110 seconds. This issue may be related to a local configuration problem or a request which works with a too large number of objects. But if you think this issue is a bug, please send a crash report.

**Output of “cmk-agent-ctl -vv daemon”

INFO [cmk_agent_ctl] starting
INFO [cmk_agent_ctl] Loaded config from '"/var/lib/cmk-agent/cmk-agent-ctl.toml"', connection registry from '"/var/lib/cmk-agent/registered_connections.json"'
INFO [cmk_agent_ctl::modes::daemon] Could not load pre-configured connections from "/var/lib/cmk-agent/pre_configured_connections.json": No such file or directory (os error 2)
DEBUG [cmk_agent_ctl::misc] Sleeping 35s to avoid DDOSing of sites
DEBUG [cmk_agent_ctl::misc] Sleeping 4s to avoid DDOSing of sites
INFO [cmk_agent_ctl::modes::pull] Start listening for incoming pull requests
INFO [cmk_agent_ctl::modes::pull] Listening on [::]:6556 for incoming pull connections (IPv6 & IPv4 if activated)
DEBUG [cmk_agent_ctl::modes::renew_certificate] Checking registered connections for certificate expiry.
INFO [cmk_agent_ctl::modes::pull] [::ffff:1]:17256: Handling pull request.
DEBUG [cmk_agent_ctl::modes::pull] [::ffff:1]:17256: Handling pull request DONE (Task detached).
DEBUG [cmk_agent_ctl::modes::pull] handle_request starts
DEBUG [cmk_agent_ctl::modes::pull] processed task!
WARN [cmk_agent_ctl::modes::pull] [::ffff:1]:17256: Request failed. (deadline has elapsed)
INFO [cmk_agent_ctl::modes::pull] [::ffff:11]:18951: Handling pull request.
DEBUG [cmk_agent_ctl::modes::pull] [::ffff:11.238]:18951: Handling pull request DONE (Task detached).
DEBUG [cmk_agent_ctl::modes::pull] handle_request starts
DEBUG [cmk_agent_ctl::modes::pull] processed task!
WARN [cmk_agent_ctl::modes::pull] [::ffff:1.238]:18951: Request failed. (deadline has elapsed)

** Output of “cmk-agent-ctl status”:

Version: 2.2.0p6
Agent socket: operational
IP allowlist: any


Connection: 192.0.2.238/central
        UUID: e23d119b-d06d-492a-9449-04e6a9bb7db0
        Local:
                Connection mode: pull-agent
                Connecting to receiver port: 8000
                Certificate issuer: Site 'central' agent signing CA
                Certificate validity: Mon, 17 Jul 2023 07:41:12 +0000 - Mon, 17 Jul 2028 07:41:12 +0000
        Remote:
                Connection mode: pull-agent
                Hostname: cns1.domain.local

Thank you,
Eddi

1 Like

What happens if you execute “check_mk_agent” manually on the command line on the problem machine?
Is it working without stopping in between?
Runtime should not be more than one or two seconds.
You agent controller log looks like a problem of executing the agent.

1 Like
<<<check_mk>>>
Version: 2.2.0p6
AgentOS: linux
Hostname: cns1.domain.local
AgentDirectory: /etc/check_mk
DataDirectory: /var/lib/check_mk_agent
SpoolDirectory: /var/lib/check_mk_agent/spool
PluginsDirectory: /usr/lib/check_mk_agent/plugins
LocalDirectory: /usr/lib/check_mk_agent/local
FailedPythonReason: 
SSHClient: 192.0.2.250 39076 22
<<<cmk_agent_ctl_status:sep(0)>>>
{"version":"2.2.0p6","agent_socket_operational":true,"ip_allowlist":[],"allow_legacy_pull":false,"connections":[{"site_id":"192.0.2.238/central","receiver_port":8000,"uuid":"e23d119b-d06d-492a-9449-04e6a9bb7db0","local":{"connection_mode":"pull-agent","cert_info":{"issuer":"Site 'central' agent signing CA","from":"Mon, 17 Jul 2023 07:41:12 +0000","to":"Mon, 17 Jul 2028 07:41:12 +0000"}},"remote":"remote_query_disabled"}]}

I’ve just noticed the “remote”:“remote_query_disabled”
Can this be the reason why I can’t query the agent?

No, same output here. Please try:

time cmk-agent-ctl dump

Is it running longer than 60 seconds?

2 Likes

For sake of transparency: I edited IPs to show the TEST-NET-1 range. I hope this is OK.

real    0m2.787s
user    0m0.010s
sys     0m0.015s

It’s ok, those are edited IPs and domain name anyway :grinning:

So. when running

telnet 192.0.2.123 6556

from the Checkmk server, do you get an output (might have to press Enter twice)? Output should be either “16” or agent output.

Escape character is '^]'.
16

That’s perfect. Now as site user on the Checkmk server (after omd su mysite) run:

time cmk -d example.com

Guys, I am really sorry for wasting your time, I found the problem, it was kind of tricky.
The IP address of the CMK Server was apparently used by a testing box previously and had a policy-based-route entry on the router which was changing the packet’s next hop to the Palo-Alto firewall which in turn was intercepting and blocking the agent’s traffic…
Thank you for taking the time to try to help me!

3 Likes

Haha. I am also learning here. We doc people usually just use funny test nets where everything works as intended. So: Everything is fine.

2 Likes