Agent execution time slowness

CMK version: 2.4.0p24.cre
OS version: Debian 13.4

Error message: N/A

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

value store: loading from disk
Checkmk version 2.4.0p24

  • FETCHING DATA
    Source: SourceInfo(hostname=‘mytarget’, ipaddress=‘target_private_ip’, ident=‘agent’, fetcher_type=<FetcherType.TCP: 8>, source_type=<SourceType.HOST: 1>)
    [cpu_tracking] Start [7f24d94e0230]
    Read from cache: AgentFileCache(path_template=/omd/sites/testsite/tmp/check_mk/cache/mytarget, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
    Not using cache (Too old. Age is 44 sec, allowed is 0 sec)
    Connecting via TCP to target_private_ip:6556 (5.0s timeout)
    Detected transport protocol: TransportProtocol.PLAIN
    Reading data from agent
    Closing TCP connection to target_private_ip:6556
    Write data to cache file /omd/sites/testsite/tmp/check_mk/cache/mytarget
    Trying to acquire lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
    Got lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
    Releasing lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
    Released lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
    [cpu_tracking] Stop [7f24d94e0230 - Snapshot(process=posix.times_result(user=0.010000000000000231, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.44999999925494194))]
    [cpu_tracking] Start [7f24d939afc0]
  • PARSE FETCHER RESULTS
    <<<check_mk>>> / Transition NOOPParser → HostSectionParser
    <<<checkmk_agent_plugins_lnx:sep(0)>>> / Transition HostSectionParser → HostSectionParser
    <<labels:sep(0)>> / Transition HostSectionParser → HostSectionParser
    <<<df_v2>>> / Transition HostSectionParser → HostSectionParser
    <<<df_v2>>> / Transition HostSectionParser → HostSectionParser
    <<<systemd_units>>> / Transition HostSectionParser → HostSectionParser
    <<<nfsmounts_v2:sep(0)>>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<<ps_lnx>>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<<lnx_if>>> / Transition HostSectionParser → HostSectionParser
    <<<lnx_if:sep(58)>>> / Transition HostSectionParser → HostSectionParser
    <<<tcp_conn_stats>>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<>> / Transition HostSectionParser → HostSectionParser
    <<<vbox_guest>>> / Transition HostSectionParser → HostSectionParser
    <<<postfix_mailq>>> / Transition HostSectionParser → HostSectionParser
    <<<postfix_mailq_status:sep(58)>>> / Transition HostSectionParser → HostSectionParser
    <<local:sep(0)>> / Transition HostSectionParser → HostSectionParser
    HostKey(hostname=‘mytarget’, source_type=<SourceType.HOST: 1>) → Add sections: [‘check_mk’, ‘checkmk_agent_plugins_lnx’, ‘cifsmounts’, ‘cpu’, ‘df_v2’, ‘diskstat’, ‘kernel’, ‘labels’, ‘lnx_if’, ‘local’, ‘md’, ‘mem’, ‘mounts’, ‘nfsmounts_v2’, ‘postfix_mailq’, ‘postfix_mailq_status’, ‘ps_lnx’, ‘systemd_units’, ‘tcp_conn_stats’, ‘uptime’, ‘vbox_guest’]
    Received no piggyback data
    CPU load 15 min load: 0.61, 15 min load per core: 0.10 (6 cores)
    CPU utilization Total CPU: 11.45%
    Check_MK Agent Version: 2.4.0p24, OS: linux, Agent plug-ins: 0, Local checks: 0
    Disk IO SUMMARY PEND Initializing counters
    Filesystem / PEND Counter ‘/.delta’ has been initialized. Result available on second check execution., Used: 90.66% - 537 GiB of 592 GiB (warn/crit at 80.00%/90.00% used)
    Interface 2 [eth0], (up), MAC: 00:50:56:BE:38:E0, Speed: 10 GBit/s
    Kernel Performance PEND Counter ‘processes’ has been initialized. Result available on second check execution.
    Memory Total virtual memory: 17.07% - 3.37 GiB of 19.8 GiB, 10 additional details available
    Mount options of / Mount options exactly as expected
    Number of threads 183, Usage: 0.19%
    Postfix Queue default Deferred queue length: 36 (warn/crit at 10/20)(!!), Active queue length: 0
    Postfix status default Status: the Postfix mail system is running, PID: 1344
    Systemd Service Summary Total: 79, Disabled: 3, Failed: 0
    Systemd Socket Summary Total: 9, Disabled: 1, Failed: 0
    TCP Connections Established: 14
    Uptime Up since 2026-04-03 11:19:34, Uptime: 18 days 23 hours
    [cpu_tracking] Stop [7f24d939afc0 - Snapshot(process=posix.times_result(user=0.18999999999999995, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.18000000342726707))]
    [agent] Success, execution time 0.6 sec | execution_time=0.630 user_time=0.200 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.440

Hello,

I’m facing extreme slowness in the agent checks. My setup is like this :

  • a single site on a single monitoring server (a vm with 4 cpu, 8gb ram)
  • about 120 hosts monitored, exclusively via the checkmk agent. No custom plugins, just the standard agent and it’s auto discovered services.
  • all monitored targets are setup with a private ipv4 address (no hostname resolution)

This is a bit of a longer post so i’ll first describe the problem then what i tried to solve it.

The problem

Checking an agent is very slow when performed by checkmk. The agent response time in itself is quick. Here is an example :

OMD[testsite]:~$ time echo | nc target_private_ip 6556
<<<check_mk>>>
Version: 2.4.0p24
AgentOS: linux
Hostname: targethostname
AgentDirectory: /etc/check_mk
DataDirectory: /var/lib/check_mk_agent
SpoolDirectory: /var/lib/check_mk_agent/spool
PluginsDirectory: /usr/lib/check_mk_agent/plugins
LocalDirectory: /usr/lib/check_mk_agent/local
OSType: linux
OSName: Debian GNU/Linux
[....]


real    0m0.413s
user    0m0.011s
sys     0m0.020s

So 0.413s execution time with netcat. The same host checked by checkmk :

OMD[testsite]:~$ time cmk --check targethostname
[agent] Success, execution time 0.6 sec | execution_time=0.630 user_time=0.210 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420

real    0m5.984s
user    0m5.100s
sys     0m0.472s

cmk takes 5.984s

I understand cmk does a bit more than just fetch the values, but still it’s almost 15 times the data fetching time. These slow checks add up and the services become stale or simply timeout, sending alerts when the monitored hosts are actually up and running.

The diagnosis steps

Initially, the server had 2 cores and 4 gb of ram. This vm is freshly installed and dedicated to this unique task. Check_mk was installed through the provided docker image. Nothing else runs on it (except for another docker container acting a a reverse proxy for https). The load (1m) on the host server was reaching 200 at every check cycle. Load 15m was in the two digits range.

htop command show tens of python -P /omd/sites/mysite/var/check_mk/core/helper_config/latest/host_checks/mytarget process piling up (one for each monitored node), each using 75 to 100% cpu. Docker stats reported cpu usage of 500% for the cmk container.

HW/SW inventory disabled or not does not change that.

The service checking schedule was setup to 5 minutes instead of the default 1 min. Not much has changed, now the server crumbles every 5 minutes instead of one.

The server was bumped to 4 cores and 8 gb ram. It still show unusually high loads (load 15m still in the two digits).

I’ve removed the TLS registration to query the hosts in plain text and avoid the whole cryptography overhead. No change.

I’ve reinstalled check_mk through apt instead of the docker image. No luck.

I’ve stopped the production site with the 120 something hosts and created a test site with only 2 hosts. The execution time posted in the “the problem” section is from the bare install, single site with 2 hosts, in plain text communication.

/proc/pressure/memory io and cpu on the host do not report any issues anymore with the 2 hosts setup.

I’ve used cprofile to try to understand what’s taking checkmk so long. The report is too long to post in its entierity, and as a new forum user i cannot attach files to this post. I’m no developper by any means so i’m not sure how to interpret this, or even if it’s relevant. Here is an excerpt of
su - testsite -c “python3 -m cProfile -s time /omd/sites/testsite/bin/cmk --check mytarget”

[agent] Success, execution time 0.5 sec | execution_time=0.520 user_time=0.100 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420
3203148 function calls (2982729 primitive calls) in 6.381 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
18    0.413    0.023    0.413    0.023 {method ‘recv’ of ‘_socket.socket’ objects}
5414/1    0.364    0.000    6.489    6.489 {built-in method builtins.exec}
300    0.306    0.001    0.372    0.001 _core_utils.py:581(validate_core_schema)
1    0.292    0.292    0.292    0.292 {method ‘load_verify_locations’ of ‘_ssl._SSLContext’ objects}
919    0.232    0.000    0.317    0.000 _python_plugins.py:228(_collect_module_plugins)
17226    0.228    0.000    0.248    0.000 :126(_path_join)
2823    0.224    0.000    0.224    0.000 {built-in method marshal.loads}
9167    0.217    0.000    0.268    0.000 validatedstr.py:36(new)
8220    0.103    0.000    0.104    0.000 inspect.py:3076()
374568/370097    0.098    0.000    0.109    0.000 {built-in method builtins.isinstance}
4089    0.097    0.000    0.097    0.000 {built-in method _io.open_code}
5654    0.089    0.000    0.089    0.000 {method ‘read’ of ‘_io.BufferedReader’ objects}
9533    0.088    0.000    0.088    0.000 {built-in method posix.stat}
150067    0.068    0.000    0.077    0.000 {built-in method builtins.getattr}
300    0.063    0.000    0.064    0.000 {built-in method pydantic_core._pydantic_core.validate_core_schema}
3730/3621    0.062    0.000    1.662    0.000 {built-in method builtins.build_class}
1    0.059    0.059    0.059    0.059 {built-in method _hashlib.scrypt}
32814/31689    0.058    0.000    0.090    0.000 {built-in method new of type object at 0x7fb50d819ea0}
147247    0.050    0.000    0.050    0.000 {method ‘startswith’ of ‘str’ objects}

and here is ordered by cumulative time, if that helps

[agent] Success, execution time 0.5 sec | execution_time=0.520 user_time=0.090 system_time=0.010 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420
3203889 function calls (2983468 primitive calls) in 6.333 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
5414/1    0.588    0.000    6.436    6.436 {built-in method builtins.exec}
1    0.000    0.000    6.435    6.435 cmk:1()
2240/1064    0.020    0.000    5.105    0.005 :1349(_find_and_load)
2212/1046    0.015    0.000    5.046    0.005 :1304(_find_and_load_unlocked)
2175/1144    0.011    0.000    4.842    0.004 :911(_load_unlocked)
2044/1076    0.007    0.000    4.795    0.004 :993(exec_module)
4712/2131    0.005    0.000    4.603    0.002 :480(_call_with_frames_removed)
192    0.006    0.000    4.367    0.023 init.py:1()
1    0.000    0.000    3.971    3.971 config.py:1414(load_all_plugins)
2    0.000    0.000    3.970    1.985 contextlib.py:78(inner)
941/933    0.002    0.000    3.162    0.003 init.py:73(import_module)
1029/933    0.002    0.000    3.160    0.003 :1375(_gcd_import)
1    0.001    0.001    3.155    3.155 _discover.py:59(load_all_plugins)
1    0.001    0.001    2.915    2.915 _python_plugins.py:49(discover_all_plugins)
1    0.001    0.001    2.909    2.909 _python_plugins.py:62(discover_plugins_from_modules)
919    0.003    0.000    2.905    0.003 _python_plugins.py:215(add_from_module)
921    0.001    0.000    2.791    0.003 _python_plugins.py:176(_import_optionally)
284/81    0.001    0.000    2.514    0.031 {built-in method builtins.import}
3412/2599    0.009    0.000    2.426    0.001 :1390(_handle_fromlist)

Is the 5 to 6 seconds execution time normal for a single agent check, using 100% cpu or is there something wrong with my setup ? If it is, what would i need hardware wise ? 64 cores cpu ? That doesnt sound right. We are currently moving away from our old nagios that does all that on a single core with 2 gb of ram without sweat, average load staying at 0.2

Thanks for your input.

Quick solution for your problem is to configure the amount of concurrent checks your Nagios core can do. In the default config it tries to do all the checks at once what will lead to high load values and long execution times.
Some information to this problem.

Ah yes thank you Andreas. I’ve matched the number of concurrent checks to the number of processor cores on the machine and the load now seems steady. I’ll keep monitoring and tinkering with this value to get the best results.

Kind regards,

Normally you can configure the double amount of available cpu cores as the maximum concurrent checks i think.