CMK version: 2.4.0p24.cre
OS version: Debian 13.4
Error message: N/A
Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
value store: loading from disk
Checkmk version 2.4.0p24
- FETCHING DATA
Source: SourceInfo(hostname=‘mytarget’, ipaddress=‘target_private_ip’, ident=‘agent’, fetcher_type=<FetcherType.TCP: 8>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f24d94e0230]
Read from cache: AgentFileCache(path_template=/omd/sites/testsite/tmp/check_mk/cache/mytarget, max_age=MaxAge(checking=0, discovery=90.0, inventory=90.0), simulation=False, use_only_cache=False, file_cache_mode=6)
Not using cache (Too old. Age is 44 sec, allowed is 0 sec)
Connecting via TCP to target_private_ip:6556 (5.0s timeout)
Detected transport protocol: TransportProtocol.PLAIN
Reading data from agent
Closing TCP connection to target_private_ip:6556
Write data to cache file /omd/sites/testsite/tmp/check_mk/cache/mytarget
Trying to acquire lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
Got lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
Releasing lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
Released lock on /omd/sites/testsite/tmp/check_mk/cache/mytarget
[cpu_tracking] Stop [7f24d94e0230 - Snapshot(process=posix.times_result(user=0.010000000000000231, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.44999999925494194))]
[cpu_tracking] Start [7f24d939afc0] - PARSE FETCHER RESULTS
<<<check_mk>>> / Transition NOOPParser → HostSectionParser
<<<checkmk_agent_plugins_lnx:sep(0)>>> / Transition HostSectionParser → HostSectionParser
<<labels:sep(0)>> / Transition HostSectionParser → HostSectionParser
<<<df_v2>>> / Transition HostSectionParser → HostSectionParser
<<<df_v2>>> / Transition HostSectionParser → HostSectionParser
<<<systemd_units>>> / Transition HostSectionParser → HostSectionParser
<<<nfsmounts_v2:sep(0)>>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<<ps_lnx>>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<<lnx_if>>> / Transition HostSectionParser → HostSectionParser
<<<lnx_if:sep(58)>>> / Transition HostSectionParser → HostSectionParser
<<<tcp_conn_stats>>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<>> / Transition HostSectionParser → HostSectionParser
<<<vbox_guest>>> / Transition HostSectionParser → HostSectionParser
<<<postfix_mailq>>> / Transition HostSectionParser → HostSectionParser
<<<postfix_mailq_status:sep(58)>>> / Transition HostSectionParser → HostSectionParser
<<local:sep(0)>> / Transition HostSectionParser → HostSectionParser
HostKey(hostname=‘mytarget’, source_type=<SourceType.HOST: 1>) → Add sections: [‘check_mk’, ‘checkmk_agent_plugins_lnx’, ‘cifsmounts’, ‘cpu’, ‘df_v2’, ‘diskstat’, ‘kernel’, ‘labels’, ‘lnx_if’, ‘local’, ‘md’, ‘mem’, ‘mounts’, ‘nfsmounts_v2’, ‘postfix_mailq’, ‘postfix_mailq_status’, ‘ps_lnx’, ‘systemd_units’, ‘tcp_conn_stats’, ‘uptime’, ‘vbox_guest’]
Received no piggyback data
CPU load 15 min load: 0.61, 15 min load per core: 0.10 (6 cores)
CPU utilization Total CPU: 11.45%
Check_MK Agent Version: 2.4.0p24, OS: linux, Agent plug-ins: 0, Local checks: 0
Disk IO SUMMARY PEND Initializing counters
Filesystem / PEND Counter ‘/.delta’ has been initialized. Result available on second check execution., Used: 90.66% - 537 GiB of 592 GiB (warn/crit at 80.00%/90.00% used)
Interface 2 [eth0], (up), MAC: 00:50:56:BE:38:E0, Speed: 10 GBit/s
Kernel Performance PEND Counter ‘processes’ has been initialized. Result available on second check execution.
Memory Total virtual memory: 17.07% - 3.37 GiB of 19.8 GiB, 10 additional details available
Mount options of / Mount options exactly as expected
Number of threads 183, Usage: 0.19%
Postfix Queue default Deferred queue length: 36 (warn/crit at 10/20)(!!), Active queue length: 0
Postfix status default Status: the Postfix mail system is running, PID: 1344
Systemd Service Summary Total: 79, Disabled: 3, Failed: 0
Systemd Socket Summary Total: 9, Disabled: 1, Failed: 0
TCP Connections Established: 14
Uptime Up since 2026-04-03 11:19:34, Uptime: 18 days 23 hours
[cpu_tracking] Stop [7f24d939afc0 - Snapshot(process=posix.times_result(user=0.18999999999999995, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.18000000342726707))]
[agent] Success, execution time 0.6 sec | execution_time=0.630 user_time=0.200 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.440
Hello,
I’m facing extreme slowness in the agent checks. My setup is like this :
- a single site on a single monitoring server (a vm with 4 cpu, 8gb ram)
- about 120 hosts monitored, exclusively via the checkmk agent. No custom plugins, just the standard agent and it’s auto discovered services.
- all monitored targets are setup with a private ipv4 address (no hostname resolution)
This is a bit of a longer post so i’ll first describe the problem then what i tried to solve it.
The problem
Checking an agent is very slow when performed by checkmk. The agent response time in itself is quick. Here is an example :
OMD[testsite]:~$ time echo | nc target_private_ip 6556
<<<check_mk>>>
Version: 2.4.0p24
AgentOS: linux
Hostname: targethostname
AgentDirectory: /etc/check_mk
DataDirectory: /var/lib/check_mk_agent
SpoolDirectory: /var/lib/check_mk_agent/spool
PluginsDirectory: /usr/lib/check_mk_agent/plugins
LocalDirectory: /usr/lib/check_mk_agent/local
OSType: linux
OSName: Debian GNU/Linux
[....]
real 0m0.413s
user 0m0.011s
sys 0m0.020s
So 0.413s execution time with netcat. The same host checked by checkmk :
OMD[testsite]:~$ time cmk --check targethostname
[agent] Success, execution time 0.6 sec | execution_time=0.630 user_time=0.210 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420
real 0m5.984s
user 0m5.100s
sys 0m0.472s
cmk takes 5.984s
I understand cmk does a bit more than just fetch the values, but still it’s almost 15 times the data fetching time. These slow checks add up and the services become stale or simply timeout, sending alerts when the monitored hosts are actually up and running.
The diagnosis steps
Initially, the server had 2 cores and 4 gb of ram. This vm is freshly installed and dedicated to this unique task. Check_mk was installed through the provided docker image. Nothing else runs on it (except for another docker container acting a a reverse proxy for https). The load (1m) on the host server was reaching 200 at every check cycle. Load 15m was in the two digits range.
htop command show tens of python -P /omd/sites/mysite/var/check_mk/core/helper_config/latest/host_checks/mytarget process piling up (one for each monitored node), each using 75 to 100% cpu. Docker stats reported cpu usage of 500% for the cmk container.
HW/SW inventory disabled or not does not change that.
The service checking schedule was setup to 5 minutes instead of the default 1 min. Not much has changed, now the server crumbles every 5 minutes instead of one.
The server was bumped to 4 cores and 8 gb ram. It still show unusually high loads (load 15m still in the two digits).
I’ve removed the TLS registration to query the hosts in plain text and avoid the whole cryptography overhead. No change.
I’ve reinstalled check_mk through apt instead of the docker image. No luck.
I’ve stopped the production site with the 120 something hosts and created a test site with only 2 hosts. The execution time posted in the “the problem” section is from the bare install, single site with 2 hosts, in plain text communication.
/proc/pressure/memory io and cpu on the host do not report any issues anymore with the 2 hosts setup.
I’ve used cprofile to try to understand what’s taking checkmk so long. The report is too long to post in its entierity, and as a new forum user i cannot attach files to this post. I’m no developper by any means so i’m not sure how to interpret this, or even if it’s relevant. Here is an excerpt of
su - testsite -c “python3 -m cProfile -s time /omd/sites/testsite/bin/cmk --check mytarget”
[agent] Success, execution time 0.5 sec | execution_time=0.520 user_time=0.100 system_time=0.000 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420
3203148 function calls (2982729 primitive calls) in 6.381 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
18 0.413 0.023 0.413 0.023 {method ‘recv’ of ‘_socket.socket’ objects}
5414/1 0.364 0.000 6.489 6.489 {built-in method builtins.exec}
300 0.306 0.001 0.372 0.001 _core_utils.py:581(validate_core_schema)
1 0.292 0.292 0.292 0.292 {method ‘load_verify_locations’ of ‘_ssl._SSLContext’ objects}
919 0.232 0.000 0.317 0.000 _python_plugins.py:228(_collect_module_plugins)
17226 0.228 0.000 0.248 0.000 :126(_path_join)
2823 0.224 0.000 0.224 0.000 {built-in method marshal.loads}
9167 0.217 0.000 0.268 0.000 validatedstr.py:36(new)
8220 0.103 0.000 0.104 0.000 inspect.py:3076()
374568/370097 0.098 0.000 0.109 0.000 {built-in method builtins.isinstance}
4089 0.097 0.000 0.097 0.000 {built-in method _io.open_code}
5654 0.089 0.000 0.089 0.000 {method ‘read’ of ‘_io.BufferedReader’ objects}
9533 0.088 0.000 0.088 0.000 {built-in method posix.stat}
150067 0.068 0.000 0.077 0.000 {built-in method builtins.getattr}
300 0.063 0.000 0.064 0.000 {built-in method pydantic_core._pydantic_core.validate_core_schema}
3730/3621 0.062 0.000 1.662 0.000 {built-in method builtins.build_class}
1 0.059 0.059 0.059 0.059 {built-in method _hashlib.scrypt}
32814/31689 0.058 0.000 0.090 0.000 {built-in method new of type object at 0x7fb50d819ea0}
147247 0.050 0.000 0.050 0.000 {method ‘startswith’ of ‘str’ objects}
and here is ordered by cumulative time, if that helps
[agent] Success, execution time 0.5 sec | execution_time=0.520 user_time=0.090 system_time=0.010 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.420
3203889 function calls (2983468 primitive calls) in 6.333 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
5414/1 0.588 0.000 6.436 6.436 {built-in method builtins.exec}
1 0.000 0.000 6.435 6.435 cmk:1()
2240/1064 0.020 0.000 5.105 0.005 :1349(_find_and_load)
2212/1046 0.015 0.000 5.046 0.005 :1304(_find_and_load_unlocked)
2175/1144 0.011 0.000 4.842 0.004 :911(_load_unlocked)
2044/1076 0.007 0.000 4.795 0.004 :993(exec_module)
4712/2131 0.005 0.000 4.603 0.002 :480(_call_with_frames_removed)
192 0.006 0.000 4.367 0.023 init.py:1()
1 0.000 0.000 3.971 3.971 config.py:1414(load_all_plugins)
2 0.000 0.000 3.970 1.985 contextlib.py:78(inner)
941/933 0.002 0.000 3.162 0.003 init.py:73(import_module)
1029/933 0.002 0.000 3.160 0.003 :1375(_gcd_import)
1 0.001 0.001 3.155 3.155 _discover.py:59(load_all_plugins)
1 0.001 0.001 2.915 2.915 _python_plugins.py:49(discover_all_plugins)
1 0.001 0.001 2.909 2.909 _python_plugins.py:62(discover_plugins_from_modules)
919 0.003 0.000 2.905 0.003 _python_plugins.py:215(add_from_module)
921 0.001 0.000 2.791 0.003 _python_plugins.py:176(_import_optionally)
284/81 0.001 0.000 2.514 0.031 {built-in method builtins.import}
3412/2599 0.009 0.000 2.426 0.001 :1390(_handle_fromlist)
Is the 5 to 6 seconds execution time normal for a single agent check, using 100% cpu or is there something wrong with my setup ? If it is, what would i need hardware wise ? 64 cores cpu ? That doesnt sound right. We are currently moving away from our old nagios that does all that on a single core with 2 gb of ram without sweat, average load staying at 0.2
Thanks for your input.