CMK 2.1 - "Activating changes" significantly slower than in all previous releases

CMK version: CEE 2.1.0p9
OS version: Ubuntu 20.04.4 LTS

One of the enhancements advertised for 2.1 has been “faster change activation, incremental activation”.

However, for us the experience has been the polar opposite:

With 2.0.x, no matter which patch release, clicking “Activate” in the GUI would take about 11-13 seconds to finish (~30,000 services, 1 site, no satelite sites)

After updating to 2.1.x for the first time, the first activation immediately following the major update took about 2 minutes. “Okay, maybe it rearranged the entire core layout for the first time”.

Any subsequent activation now takes between 40 and 60 seconds, so at least 4x as long as before.

Every update to a new patch release results a one-time 2 minute activation after the site is started and the “Activate” button is highlighted (this wasn’t the case at all in 2.0 or before, why would I need to re-Activate changes and why would this not be handled by the release update script that is printing countless steps to stdout anyway)

Does anybody else experience this? How could we begin to troubleshoot what would cause this? Do we need to change our custom plugins?

2 Likes

First check to see where it takes so long would be a “cmk --debug -vvR” on the command line.
This is like the activation in the background.

That’s how it looks with 2.1 - 49 seconds spent with protobuf compiling - should this not happen in parallel instead of serially, is this supposed to take this long, and is it supposed to happen at all if the new protobuf is identical to the previous one (as no config changes have been applied)

OMD[MYSITE]:~$ cmk --debug -vvR | ts '[%Y-%m-%d %H:%M:%S]'
WARNING: 1 deprecated plugins will be ignored in Checkmk version 2.2 (see werk #14084): mikrotik_os_info
[2022-08-08 13:20:06] Trying to acquire lock on /omd/sites/MYSITE/etc/check_mk/main.mk
[2022-08-08 13:20:06] Got lock on /omd/sites/MYSITE/etc/check_mk/main.mk
[2022-08-08 13:20:06] Generating configuration for core (type cmc)...
[2022-08-08 13:20:06] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/serial.mk
[2022-08-08 13:20:06] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/serial.mk
[2022-08-08 13:20:06] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/serial.mk
[2022-08-08 13:20:06] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/serial.mk
[2022-08-08 13:20:06] 
[2022-08-08 13:20:06] time period 'Abrechnungsdaten-Vortag': 2922 time points
[2022-08-08 13:20:06] time period 'backup': 420 time points (optimized from 2086)
[2022-08-08 13:20:06] time period 'gz': 2086 time points
[2022-08-08 13:20:06] time period 'never': 0 time points
[2022-08-08 13:20:06] time period 'notifications-pipeline': 2922 time points
[2022-08-08 13:20:06] time period 'pegasusraid-exception': 420 time points (optimized from 2922)
[2022-08-08 13:20:06] time period 'trd-ok': 2086 time points
[2022-08-08 13:20:06] time period 'san': 2924 time points (optimized from 5844)
[2022-08-08 13:20:06] time period '24X7': 2 time points
[2022-08-08 13:20:06] Starting full compilation for all hosts Creating global helper config...OK
[2022-08-08 13:20:06]  Creating cmc protobuf configuration...Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/FTTx-CPE
[2022-08-08 13:20:06] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/FTTx-CPE
[2022-08-08 13:20:06] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/FTTx-CPE
[2022-08-08 13:20:06] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/FTTx-CPE
[2022-08-08 13:20:06] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/eBGP-Peer
[2022-08-08 13:20:06] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/eBGP-Peer
[2022-08-08 13:20:06] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/eBGP-Peer
[2022-08-08 13:20:06] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/eBGP-Peer
[2022-08-08 13:20:06] WARNING: Cannot lookup IP address of 'HOST-asp' (Failed to lookup IPv4 address of HOST-asp via DNS: [Errno -3] Temporary failure in name resolution). The host will not be monitored correctly.
[2022-08-08 13:20:06] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/HOST-asp
[2022-08-08 13:20:06] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/HOST-asp
...
[2022-08-08 13:20:55] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/server30
[2022-08-08 13:20:55] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/server30
[2022-08-08 13:20:55] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/server30
[2022-08-08 13:20:55] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/core_host_config/server30
[2022-08-08 13:20:56] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/helper_cache_info
[2022-08-08 13:20:56] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/helper_cache_info
[2022-08-08 13:20:56] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/helper_cache_info
[2022-08-08 13:20:56] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2419/helper_cache_info
[2022-08-08 13:20:56] OK
[2022-08-08 13:20:56] /omd/sites/MYSITE/var/check_mk/core/config.pb written.
[2022-08-08 13:20:56] Baking agents...Trying to acquire lock on /omd/sites/MYSITE/etc/check_mk/multisite.mk
[2022-08-08 13:20:56] Got lock on /omd/sites/MYSITE/etc/check_mk/multisite.mk
[2022-08-08 13:20:57] Releasing lock on /omd/sites/MYSITE/etc/check_mk/multisite.mk
[2022-08-08 13:20:57] Released lock on /omd/sites/MYSITE/etc/check_mk/multisite.mk
[2022-08-08 13:20:57] VANILLA...linux_deb:uptodate...linux_rpm:uptodate...solaris_pkg:uptodate...windows_msi:uptodate...linux_tgz:uptodate...solaris_tgz:uptodate...aix_tgz:uptodate...OK
[2022-08-08 13:20:57] GENERIC...linux_deb:uptodate...linux_rpm:uptodate...solaris_pkg:uptodate...windows_msi:uptodate...linux_tgz:uptodate...solaris_tgz:uptodate...aix_tgz:uptodate...OK
[...]
[2022-08-08 13:21:09] server063...linux_deb:uptodate...linux_rpm:uptodate...solaris_pkg:uptodate...windows_msi:uptodate...linux_tgz:uptodate...solaris_tgz:uptodate...aix_tgz:uptodate...OK
[2022-08-08 13:21:09] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/wato/bakery_file_status.mk
[2022-08-08 13:21:09] Got lock on /omd/sites/MYSITE/var/check_mk/wato/bakery_file_status.mk
[2022-08-08 13:21:09] Releasing lock on /omd/sites/MYSITE/var/check_mk/wato/bakery_file_status.mk
[2022-08-08 13:21:09] Released lock on /omd/sites/MYSITE/var/check_mk/wato/bakery_file_status.mk
[2022-08-08 13:21:09] OK
[2022-08-08 13:21:21] Restarting monitoring core...OK
[2022-08-08 13:21:21] Releasing lock on /omd/sites/MYSITE/etc/check_mk/main.mk
[2022-08-08 13:21:21] Released lock on /omd/sites/MYSITE/etc/check_mk/main.mk

How many of this warning do you have?
I had a test system with many lookup message and it was really slow - changed all settings that this not happens and the time was fine again.

[2022-08-08 22:12:58] Starting full compilation for all hosts Creating global helper config...OK
[...]
[2022-08-08 22:13:28] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2420/helper_cache_info
[2022-08-08 22:13:28] OK
[2022-08-08 22:13:28] /omd/sites/MYSITE/var/check_mk/core/config.pb written.
[2022-08-08 22:13:28] Baking agents...Trying to acquire lock on /omd/sites/MYSITE/etc/check_mk/multisite.mk
[...]
[2022-08-08 22:13:36] OK
[2022-08-08 22:13:51] Restarting monitoring core...OK

Moving the non-resolvable hosts out of the CMK config brings the protobuf delta down from 49 seconds to 30 seconds, so it makes an impact - but we’re still far from the original 14 seconds, and even further from any “improvements” touted as significant during the Checkmk conference.

This message “Cannot lookup IP address of” began to be logged once we’d updated to 2.0 and only concerns hosts that we inject into the site via cmcdump of another site. Our site is indeed unable to resolve any of those hosts, and we wouldn’t expect it to. We expect that Checkmk would not need to worry about that at all, after all it passively consumes the host and service metadata and check results as-is.

Also, all of those hosts dutifully get their IP addresses imported into our CEE as part of the cmcdump and neither the original instance nor our instance need to look up anything at all, given that all of our hosts on all of our sites have their primary IP address statically defined as host property - we never do DNS lookups for any of our hosts, save 2 or 3 such as “www.google.com” as reference ICMP targets for WAN connectivity reports.

This “non-resolvable IP address slows down the process” might also explain the high CPU utilization we’ve experienced on yet another node - we cmcdump all of the 30,000 services of our main instance to another Checkmk VM in our DMZ in order to be able to access Checkmk results without need for dial-in VPN. With 2.0, this was slow but stable, with 2.1 it’s really really taxing on the “slave” instance - constantly starved of resources and Nagstamon times out. In this case there are hundreds of occurrences of “Cannot lookup IP address of” every minute, every time we cmcdump the current state of the main instance.

I’m seeing a similar increase in runtime in CRE too, but no non-resolvable IPs in my case.

My log:

OMD[zdv]:~$ cmk --debug -vvR | ts '[%Y-%m-%d %H:%M:%S]'
[2022-08-09 13:41:24] Trying to acquire lock on /omd/sites/zdv/etc/check_mk/main.mk
[2022-08-09 13:41:24] Got lock on /omd/sites/zdv/etc/check_mk/main.mk
[2022-08-09 13:41:24] Generating configuration for core (type nagios)...
[2022-08-09 13:41:24] Trying to acquire lock on /omd/sites/zdv/var/check_mk/core/helper_config/serial.mk
[2022-08-09 13:41:24] Got lock on /omd/sites/zdv/var/check_mk/core/helper_config/serial.mk
[2022-08-09 13:41:24] Releasing lock on /omd/sites/zdv/var/check_mk/core/helper_config/serial.mk
[2022-08-09 13:41:24] Released lock on /omd/sites/zdv/var/check_mk/core/helper_config/serial.mk
[2022-08-09 13:44:03] Trying to acquire lock on /omd/sites/zdv/etc/nagios/conf.d/check_mk_objects.cfg
[2022-08-09 13:44:03] Got lock on /omd/sites/zdv/etc/nagios/conf.d/check_mk_objects.cfg
[2022-08-09 13:44:03] Releasing lock on /omd/sites/zdv/etc/nagios/conf.d/check_mk_objects.cfg
[2022-08-09 13:44:03] Released lock on /omd/sites/zdv/etc/nagios/conf.d/check_mk_objects.cfg
[2022-08-09 13:44:03] Precompiling host checks...Creating precompiled host check config...
[2022-08-09 13:44:03]webstats-01     : Precompiling host checks...
 cmk.base.plugins.agent_based.apt cmk.base.plugins.agent_based.arbor_peakflow_sp cmk.base.plugins.agent_based.arbor_peakflow_tms cmk.base.plugins.agent_based.arbor_pravail cmk.base.plugins.agent_based.check_mk cmk.base.plugins.agent_based.checkmk_agent cmk.base.plugins.agent_based.checkmk_agent_plugins cmk.base.plugins.agent_based.chrony cmk.base.plugins.agent_based.cmk_agent_ctl_status cmk.base.plugins.agent_based.cpu cmk.base.plugins.agent_based.cpu_load cmk.base.plugins.agent_based.cpu_threads cmk.base.plugins.agent_based.datapower_tcp cmk.base.plugins.agent_based.df_section cmk.base.plugins.agent_based.diskstat cmk.base.plugins.agent_based.diskstat_statgrab_section cmk.base.plugins.agent_based.docker_container_diskstat cmk.base.plugins.agent_based.docker_container_diskstat_cgroupv2 cmk.base.plugins.agent_based.docker_container_mem cmk.base.plugins.agent_based.docker_container_mem_cgroupv2 cmk.base.plugins.agent_based.fileinfo cmk.base.plugins.agent_based.hpux_cpu cmk.base.plugins.agent_based.hr_mem cmk.base.plugins.agent_based.kernel cmk.base.plugins.agent_based.kube_uptime cmk.base.plugins.agent_based.lnx_bonding cmk.base.plugins.agent_based.lnx_if cmk.base.plugins.agent_based.local cmk.base.plugins.agent_based.logwatch cmk.base.plugins.agent_based.logwatch_section cmk.base.plugins.agent_based.mcafee_emailgateway_cpuload cmk.base.plugins.agent_based.mem cmk.base.plugins.agent_based.mem_used_sections cmk.base.plugins.agent_based.mobileiron_section cmk.base.plugins.agent_based.mrpe cmk.base.plugins.agent_based.mysql_capacity cmk.base.plugins.agent_based.ntp cmk.base.plugins.agent_based.ovs_bonding cmk.base.plugins.agent_based.ps_check cmk.base.plugins.agent_based.ps_section cmk.base.plugins.agent_based.snmp_uptime cmk.base.plugins.agent_based.statgrab_load cmk.base.plugins.agent_based.systemd_units cmk.base.plugins.agent_based.tcp_conn_stats cmk.base.plugins.agent_based.ucd_cpu_load cmk.base.plugins.agent_based.uptime cmk.base.plugins.agent_based.windows_intel_bonding cmk.base.plugins.agent_based.windows_os_bonding cmk.base.plugins.agent_based.winperf_tcp_conn apt burp cfengine3 debian_lts_status df kernel.performance logfiles mem.linux mounts multipath mysql mysql.connections mysql.innodb_io mysql.sessions mysql.status mysql_ping nfsmounts[2022-08-09 13:44:03] Trying to acquire lock on /omd/sites/zdv/var/check_mk/core/helper_config/491/host_checks/webstats-01.py
[2022-08-09 13:44:03] Got lock on /omd/sites/zdv/var/check_mk/core/helper_config/491/host_checks/webstats-01.py
[2022-08-09 13:44:03] Releasing lock on /omd/sites/zdv/var/check_mk/core/helper_config/491/host_checks/webstats-01.py
[2022-08-09 13:44:03] Released lock on /omd/sites/zdv/var/check_mk/core/helper_config/491/host_checks/webstats-01.py
 ==> /omd/sites/zdv/var/check_mk/core/helper_config/491/host_checks/webstats-01.
[...]
[2022-08-09 13:44:07] OK
[2022-08-09 13:44:07] Running '/omd/sites/zdv/bin/nagios -vp /omd/sites/zdv/tmp/nagios/nagios.cfg'
[2022-08-09 13:44:07] Validating Nagios configuration...OK
[2022-08-09 13:44:25] Restarting monitoring core...OK
[2022-08-09 13:44:25] Releasing lock on /omd/sites/zdv/etc/check_mk/main.mk
[2022-08-09 13:44:25] Released lock on /omd/sites/zdv/etc/check_mk/main.mk

What’s happening in those nearly three minutes in the beginning?

Attaching an strace to the running process shows it oscillating between openat of ~/etc/password_store.secret and ~/var/check_mk/stored_passwords - maybe that helps?

Probably related to The password store is now storing passwords obfuscated?

1 Like

FWIW, we’re not using the password store at all.

This problem i could verify on my instance using cmcdump. It was running 1.6 until last week :slight_smile:
I have a quick workaround only usable on my cmcdump only instance. I modified the function “ip_address_of” inside the core_config.py that it returns “0.0.0.0” for every host and makes no lookup.
This is not feasible inside a mixed environment.

I will open a separate topic for this problem as i think that is a real bug.

Thank you, I’ve seen your new topic and @SergejKipnis is already on it.

I suppose the issue that this thread here is about should also classify as “real bug” - either the 2.0->2.1 cmk-update-config is missing a step and leaves a config that takes ages for the protobuf code to parse, or the new “incremental compile” is broken under certain circumstances.

As mentioned, I didn’t do anything but update 2.0 → 2.1 according to the documentation, and instead of faster activation I get slower activation (I’ve got instances of 78 seconds of activation time sometimes, even if no patch release was installed immediately prior), and anything that is not caused by deliberate action / (mis-)configuration by me is a “real bug” in my understanding.

I understand what you mean. But the problem with the slow activation i have not on my systems. That means for myself that is something else and not a general problem.

Hi @bitwiz,

I sent this mail to checkmk as a Bug Report regarding this topic.

Possible Error with checkmk2.1.0p9 (or maybe earlier version):

using the new incremental activation of changes for host (add,edit,discover,delete) needs the same time as in earlier versions. Like suggested in the forum I checked the command cmk --debug -vvR | ts '[%Y-%m-%d %H:%M:%S]' and I found out, that even if I delete a host, the agent bakery does something (I dont know what, it isnt baking though) for all hosts. This slows down the process 4 times. Disabling the Option in Global Settings “Automatically create monitoring agents” did the trick. Activationg changes is know down to a few seconds.

The inline Help there shows: “When this option is enabled then updated monitoring agents are automatically being created whenever configuration changes are activated and the agents are outdated. Agents are being updated if their configuration has changed (due to some change in the rules of the monitoring agents).”

I think checkmk cannot identify whether there is a rule change which regards the agents or not.

please correct me if i am wrong, so I can take a look at my rules and checking whether it is an error or not.

as stated above, disabling the agent akery triggering reduced the time by 80%.
Hope this helps :slight_smile:

regards

Steffen

I think this can make sense. On the systems i maintain this setting is always turned off.
This has a simple reason - unsigned agents are a very bad thing and produce too many error messages. :smiley:

1 Like

Unfortunately, the agent baking setting does not noticably change the behavior for us. We’re still talking about 40 seconds or more for every activation, no matter what changes are made and how many hosts / services are affected by the changes.

I’m wondering how we would go about zooming into the issue - backup the entire instance, restore on another system, move all custom content (custom checks, web\wato, …) out of ~/local/ and cmk -R to get a “vanilla” Checkmk install?

Why are we talking about cmk -R in this thread btw, does a GUI “activate” not trigger a “cmk -O” by default, reloading instead of restarting? cmk -O = 26 seconds, cmk -R = 40 seconds (as it entails shutting down all the processes and spawning an entirely new master process that needs to process all the filesystem config and state files anew), so it’s a Restart apparently every time we do a GUI Activation, even for typo fixes in Host aliases… Reload duration would be more palatable, but is still 2x-3x as slow as with 2.x and previous versions

Since in 2.1 the agent has grown in size due to the agent controller and the compression of the package takes some time, the behavior is explainable. There is a setting to control, for which platform the agents are build, which can improve the situation.
But this does not help the original requester

If you do this with “-vv”, where does it takes so long?

For the old Nagios core you have the option if reload or restart is used at activation time.

What do you see on this option in your system?
→ Parallelize core config creation

Agent baking is already disabled. What is also a bad design decision to have this activated.

[2022-10-11 08:40:54]  Creating cmc protobuf configuration...Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/Telekom-Router
[2022-10-11 08:40:54] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/Telekom-Router
[2022-10-11 08:40:54] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/Telekom-Router
[2022-10-11 08:40:54] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/Telekom-Router
...
[2022-10-11 08:41:21] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/WAN-HOST1
[2022-10-11 08:41:21] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/WAN-HOST1
[2022-10-11 08:41:21] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/WAN-HOST1
[2022-10-11 08:41:21] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/core_host_config/WAN-HOST1
[2022-10-11 08:41:21] Trying to acquire lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/helper_cache_info
[2022-10-11 08:41:21] Got lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/helper_cache_info
[2022-10-11 08:41:21] Releasing lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/helper_cache_info
[2022-10-11 08:41:21] Released lock on /omd/sites/MYSITE/var/check_mk/core/helper_config/2760/helper_cache_info
[2022-10-11 08:41:21] OK
[2022-10-11 08:41:22] /omd/sites/MYSITE/var/check_mk/core/config.pb written.
[2022-10-11 08:41:22] Reloading monitoring core...OK
[2022-10-11 08:41:22] Releasing lock on /omd/sites/MYSITE/etc/check_mk/main.mk
[2022-10-11 08:41:22] Released lock on /omd/sites/MYSITE/etc/check_mk/main.mk

This is for cmk -vvO, pretty much the entire duration is spent iterating through the helper_config files for every host. 27 seconds in this current example.

Now one could assume this might be bottlenecked by disk IO of the server, but we’re writing onto SSDs, not HDD (and not virtualized but installed on the hardware), so random disk access should be more than fast enough.

And “incremental”, the new innovation with 2.1 is nowhere to be seen in any case, if it needs to touch all of the files?

The “parallelize core config creation” is still at default, so “on”.

As mentioned, agent baking is not the issue for us, we’ve set up the Agent OS rule in order to bake Windows and Linux agents only and it takes a few seconds on top and is not the issue here. We’ve always had the auto-baking setting enabled in the past (and intend to re-enable it) because we do not need to sign the agents anyway as the agent bakery does not work for us (because we do not allow access from less trusted networks to trusted networks, therefore the agents cannot reach the Checkmk server, only the other way around)

@schlarbm was correct in August, the issue was with the password store in our case as well.
Once you store a password in there (even if it is never referenced), both files are opened (and passwords inside decrypted) once per every active check in your environment during change activation. With several thousand active checks in our case, this quickly added up to many seconds.

Back in August, when I stated that we didn’t use the password store, I did not fully care to verify that we did indeed not use the feature. I knew that I personally, as main Checkmk administrator, did not introduce the feature, and I knew that there was no configuration referencing the store in our environment. Unfortunately, a colleague had added a password there in the past “just for testing”, and therefore we were affected while we didn’t even benefit from the Password Store.

Issue has been raised with Checkmk devs, so I expect this is going to be resolved in a future release of 2.1.

4 Likes

When this will be fixed @TLI ? Every activate change - also if it is just one rule/host added - takes 35 seconds. Much too long and not what is promised on your website.