after upgrading Checkmk from version 2.2.0p37 to 2.3.0p24 on January 22, 2025, we have noticed a significantly increased CPU usage on all monitoring servers.
Unfortunately this seems to be normal. While the upgrade instructions always talked about a small increase in load (like 10 to 20%), I and several others encountered increases more like 100 to 200%. See this post for example.
Ultimately I ended up splitting up my central site into two sites & increasing the resources of several of my less powerful remote sites.
The latter; that’s what the second paragraph in my reply above was supposed to convey.
I don’t think there is a solution. To me it seems like either the update to Python 3.12 which came with CMK 2.3 or something in CMK’s code itself caused the major performance decrease. That’s nothing us users can do anything about.
I just want to say, there can be tons of configuration errors or issues adding to this. @mbunkus made some valid points towards Python itself, but the 100+% increases are certainly not only down to pure application performance.
To be clear: I am not trying to blame anyone, we are taking this very seriously and strive to improve performance, where we can. But sometimes it is not (only) about the software, but also about the configuration.
@robin.gierse:
Thanks for your comment! Do you have a more detailed approach regarding potential configuration errors that could be contributing to the performance issues?
We have now successfully resolved the issue on our side.
The following measures helped:
Changing the Proxmox CPU type to x86-64-v2-AES (available since Proxmox version 8). We implemented this change yesterday morning, which roughly halved the load (see graphs).
Disabling the “Symmetric encryption” rule. This encryption has already been active by default with agent registration since CMK Agent version 2.2. After disabling it, the load is now even lower than before the upgrade to 2.3. We made this change today (see graphs).
Hopefully, our solution helps others facing similar issues. Good luck!
Just as an aside: CPU type mismatch can easily tank performance, sure; we’ve had that in the past with our video conferencing servers (BigBlueButton). Wasn’t the cause of higher load in our own case.
You could also set the CPU type to “host”, which basically means “no emulation at all”, which should perform the best. Only do this if you don’t have a cluster, or if all cluster members have the same CPU type, of course (or if you don’t care about live migration).
That is a very interesting point, though. Just took a look, it’s still active on our end. I will give turning it off a try.
Preliminary results seem to indicate that it drops CPU usage by about half (meaning down from 60% total to 30% total), which is incredibly nice to see. Very good catch, & thanks for posting about it.
The reason for the perfomance degradation in the Checkmk 2.3 with symmetric encryption is the used key derivation function. The Checkmk 2.2 agent uses 10,000 iterations, while Checkmk 2.3 uses 600,000 iterations.
On my development workstation (with an older AMD Ryzen 7 1700 CPU), the cost is
$ time openssl enc -aes-256-cbc -md sha256 -pbkdf2 -iter 10000 -k 12345678 -P
[...]
real 0m0,022s
user 0m0,017s
sys 0m0,005s
$ time openssl enc -aes-256-cbc -md sha256 -pbkdf2 -iter 600000 -k 12345678 -P
[...]
real 0m0,489s
user 0m0,484s
sys 0m0,005s
The CPU time needed for key derivation increases roughly by factor 30. On this machine, when monitoring 1000 agent based hosts with symmetric encryption every minute, you need 8 CPU cores solely for key derivation.
If the Agent Controller with TLS encryption is available, use that instead. The build-in symmetric encryption should only be used if TLS is not available. Moreover, there is no advantage in using both. Disable the symmetric encryption if you can use TLS.
So when TLS is used it is safe to disable the encryption rule.
The commit also includes the increase of pbkdf2 iterations which now match the OWASP docs.
Looking at the code also explains the impact of the proxmox CPU type: the AES extensions comes handy when using aes-256-cbc. It would be interesting how much impact of the CPU type remains when symetric encryption is disabled. Although depending on the SSL cipher it may still have a visible impact.
The OWASP recommendations refer to password storage. In web based applications, login passwords are typically of a short length and low complexity so that they could be memorized by humans. Here, a high number of iterations is an effective protection against brute force attacks on stolen password hashes.
The Checkmk agent encryption secret is typically a long random string. For example, when created with pwgen -s 48 1, the complexity even exceeds the key space of 2^256 of the AES cipher and cannot be brute forced.
An easy solution would be if the Checkmk agent from 2.3 on offered an additional configuration option in the /etc/check_mk/encryption.cfg file like
ENCRYPTION_SCHEME=03
which could be used to explicitly revert to one of the older encryption schemes.