Significant CPU Increase After Upgrading to Checkmk 2.3.0p24

Checkmk Managed Services Edition 2.3.0p24:
Debian 11

Hi everyone,

after upgrading Checkmk from version 2.2.0p37 to 2.3.0p24 on January 22, 2025, we have noticed a significantly increased CPU usage on all monitoring servers.

Do you think this behavior is normal?

Setup:

  • 1 Central Instance - 564 hosts - 32GB RAM / 16 vCPUs
  • 1 larger Remote Instance - 152 hosts - 16GB RAM / 8 vCPUs
  • 5 additional Remote Instances - Each monitoring 10-65 hosts

Core-Statistics


Best regards
Marcel

Unfortunately this seems to be normal. While the upgrade instructions always talked about a small increase in load (like 10 to 20%), I and several others encountered increases more like 100 to 200%. See this post for example.

Ultimately I ended up splitting up my central site into two sites & increasing the resources of several of my less powerful remote sites.

1 Like

Hi Moritz,

thanks for your message.

Have you already found a solution to reduce the CPU load a bit? Or have you set up an additional site and offloaded hosts there?

The latter; that’s what the second paragraph in my reply above was supposed to convey.

I don’t think there is a solution. To me it seems like either the update to Python 3.12 which came with CMK 2.3 or something in CMK’s code itself caused the major performance decrease. That’s nothing us users can do anything about.

I just want to say, there can be tons of configuration errors or issues adding to this. @mbunkus made some valid points towards Python itself, but the 100+% increases are certainly not only down to pure application performance.

To be clear: I am not trying to blame anyone, we are taking this very seriously and strive to improve performance, where we can. But sometimes it is not (only) about the software, but also about the configuration.

2 Likes

@robin.gierse:
Thanks for your comment! Do you have a more detailed approach regarding potential configuration errors that could be contributing to the performance issues?

1 Like

We have now successfully resolved the issue on our side.

The following measures helped:

  • Changing the Proxmox CPU type to x86-64-v2-AES (available since Proxmox version 8). We implemented this change yesterday morning, which roughly halved the load (see graphs).
  • Disabling the “Symmetric encryption” rule. This encryption has already been active by default with agent registration since CMK Agent version 2.2. After disabling it, the load is now even lower than before the upgrade to 2.3. We made this change today (see graphs).

Hopefully, our solution helps others facing similar issues. Good luck!

6 Likes

Just as an aside: CPU type mismatch can easily tank performance, sure; we’ve had that in the past with our video conferencing servers (BigBlueButton). Wasn’t the cause of higher load in our own case.

You could also set the CPU type to “host”, which basically means “no emulation at all”, which should perform the best. Only do this if you don’t have a cluster, or if all cluster members have the same CPU type, of course (or if you don’t care about live migration).

That is a very interesting point, though. Just took a look, it’s still active on our end. I will give turning it off a try.

3 Likes

Preliminary results seem to indicate that it drops CPU usage by about half (meaning down from 60% total to 30% total), which is incredibly nice to see. Very good catch, & thanks for posting about it.

4 Likes