How to setup check_mk_caching_agent with cmk-agent-ctl?

Hi, I have setup with two monitoring hosts running parallel for HA. Is there a possibility to configure cmk-agent-ctl to use caching variant (use wrapper check_mk_caching_agent) on Linux or somehow on Windows?
I have problems over night, when backups with huge network transfer loads cause Check_MK service timeouts on some systems. Maybe caching can help with this…
Thanks

I discovered problems with clustered host setups with more then three clustered hosts. Agent cmk-agent-ctl reports:

WARN [cmk_agent_ctl::modes::pull] [::ffff:<cmk-server-ip>]:<port>: Request failed. (Too many active connections)

The service fails with

[agent] Empty output from host <host-ip>:6556(!!), Got no information from host(!!), execution time 0.0 sec

This can be reproduced, when all Check_MK services on these clustered hosts are rescheduled at the same time. The requests go together then and a limit is reached.
I have assumed, that something like a max_connections parametr in cmk-agent-ctl.toml can solve this. I have no success with it however. Is there any documentation of possible parameters of cmk-agent-ctl.toml file?
Without this I will be forced to return back to xinetd, that works without problems. :frowning:

I will try to dig into cmk-agent-ctl source code, but have no skills with Rust…

A weird solution for now for hosts with more clustered hosts:

# /etc/systemd/system/cmk-agent-ctl-daemon.service.d/override.conf
[Service]
Environment="DEBUG_MAX_CONNECTIONS=16"
1 Like

Started to investigate the Too many active connections messages problem.

@zito I am trying to reproduce the problem. Which of the rules are you using when you get this problem:

  • Clustered services for overlapping clusters
  • Clustered services

Okay, we now understand the problem.

You have to make sure that Maximum cache file age for clusters is set to a larger value
than the largest check intervall for Check_MK Serverice of all involved cluster nodes.

If this is the case all Check_MK checks of the cluster hosts can always use the agent cache of the cluster nodes.

If you use Reschedule active checks for all your Check_MK Services and do not spread it over N minutes (where n <> 0) this can increase the number of issues.

Please be also aware, if you use port 6556 as host check command or as additional check_tcp this will also “consume” one active connection.

1 Like

Dear Jodok, Thank you very much for your reply! I missed this fundamental thing. I have my check_interval / retry_interval set to 120 seconds. Now I set the Maximum cache file age for clusters to 180 seconds accordingly. Now the things make sense. :slight_smile:
Thank you very much!