Check_MK performance

gera83 · February 2, 2023, 4:58pm

Hello guys!
You know, back on October 2019, i asked a honest question. Which Check_MK version “works better” ? Related only to resource management / efficient monitoring (agent or SNMP). Most responsive UI.

Andreas answered me beautifully:

The biggest point is the amount of supported devices.
1.2.6 → 660 checks
1.2.8 → 970 checks
1.4.0 → 1100 checks
1.5.0 → 1330 checks
1.6.0 → 1470 checks
The “work better” depends again on your scenario of usage.
I have some customers who are heavy users of the reporting feature and there are also some improvements over time.
Also the graphing inside the enterprise edition is a huge step forward for better prediction and to visualize your data.
If you only look at the needed resources for your monitoring server then it is possible the 1.4 looks better then 1.5 or 1.6.

Well, im back, 4 years later. Because im migrating all my Check_MK VMs (CMK servers), with a lot of versions, a lot of sites, to a PHYSICAL server. Lots of cores, lots of ram.

So, my question. Can you share minimal experiences?
Forget about the amount of checks. Because, if i need to monitor K8S, i surely will need 2.1.0, which IM DOING.

I’m talking about simple things:

Server monitoring (with agent). All windows. All linux.
Dell iDracs / Cisco Switches (SNMP).
VMware ESXi monitoring

Focus on the server side, not the clients, because i will never update 3000 agents (servers). I have 1.2.6p16 on 3000 servers. Works perfect with any version, even with 2.1.0 server, thank god.

So, back to the question:
Are you noticing 2.1.0 is really BETTER / faster?
Or do you notice that you get better monitoring performance with 1.4.0 / 1.5.0 / 1.6.0?
May be 2.0?

Minimal experiences guys, like:
“i tried 2.1.0 with idracs, but is really slow, even lowering the time between cheks. I will stick with 1.6.0 for now”
“2.0 is the faster in everything = UI, CPU and memory consumption. But i leave a 1.5.0 instance for Fortinet firewalls, because…”
“i will never let go my 1.5.0 sites, is extremely fast. 2.0.0 is slow as hell”

Everything is useful !!

Thank you all in advance!!

Anders · February 2, 2023, 5:59pm

Impossible to answer as hardware is getting faster, but also networks, storage, firewalls, kernels etc.
We have millions of services - This is quite tiny in our world.

We upgrade to be compliance so for us having old agens it not acceptable.
We dumped the VM path years ago. Only physical servers. Even dumped SSD as they are to slow.
Will 2.1 be better? Have no *€%&€ clue. We just have to find out the hard way.

martin.hirschvogel · February 2, 2023, 7:46pm

As Anders says, it is actually impossible to answer, also because the setups and requirements in the Checkmk userbase are so widely different. As you are looking for user experiences, here is something I remember:

In general, we are always working on improving performance of Checkmk. Thus, every version has to be better than the one before. Just due to the fact that we have pretty large installations out there with millions of services and hundreds of users - and if we wouldn’t try to improve the performance continuously that wouldn’t work out.

As a recap, there were a couple of major changes in recent versions.
In Checkmk 2.0, a major piece was rearchitected: the helpers were split into fetchers and checkers.
Also activate changes was improved in 2.0 with incremental sync of configs of remote sites (no more full config sync). Activate changes was further reworked in 2.1 to activate typical changes (like adding hosts) much quicker.
Also the search bars in Checkmk 2.0 are leveraging Redis, and thus are pretty quick.
We also went from Python 2.7 to Python 3.8 in Checkmk 2.0, which probably didn’t do much in performance. But rumors are Python 3.11, which will be shipped with Checkmk 2.2 is much faster.

There are many further changes, like adding caches here and there, which can have a lot of impact.

gera83 · February 2, 2023, 11:01pm

No need to get mad :). I was just looking for minimum real user experience.
It’s really really great to read experiences, apart from my own.

gera83 · February 2, 2023, 11:01pm

Fabulous information. Very interesting
Thanks!!

LaSoe · February 7, 2023, 8:55pm

The overall resource consumption and the number of checks you can run on the Core has improved a lot and the activation of certain changes is now also much faster. The performance of activating our daily work changes (rules, thresholds, etc.) has not really improved from our subjective point of view.

If you only need the OS metrics the Agents do a great job. When you have additional local checks you are quickly over the 1 min default interval because the unixlike Agents still run everything sequentially (and no, async is not a solution, it’s a workaround). In terms of performance, the unixlike Agents have not improved much in the last 5 years. The Windows Agent on the other hand has been rebuilt and in my opinion offers now the better extensibility and control options than the unixlike Agents.

What bothers us the most in our daily work within Checkmk is working with Wato, which has become slower with each release, especially since the introduction of the new GUI with 2.0. After many bug tickets, some things have improved significantly, but they still have a long way to go before we get a truly responsive GUI.

And finally, of course, it should be mentioned that many new functions like the Rest API and check improvments have been added which support an upgrade to 2.x. But before upgrading to a 2.x version, you should test it carefully ;-). We were very grateful that we had a very competent tribe29 employee on call during our major releas upgrades.

This is my personal, subjective observation based on our environment (2.1.cme, 2000 hosts, 200,000 services, 30 sites all running on the same bare metal server)

gera83 · February 7, 2023, 10:04pm

Hi Lars! Fabulous!

YES, well, that’s why we keep using 1.2.6p16 agents. There is nothing new on the agents, just a better feature management (enable/disable stuff). So, 1.2.6p16, with everything regarding event logs, totally disabled (we handle servers logs with another product). It’s quick as hell. Then, servers, a lot of sites, a lot of versions :).

Then, UI / WATO. I would love to see more responsiveness, but i guess is not that easy. Then, WATO. It’s quite funny, because i LOVE mk files. I know tribe29 encourage people to use WATO, but I prefer MK files 1000%. I’m a linux person. It’s easier to find / change things and reload on MK files. I hate WATO XD. I’ve using it since 1.2.6 and i’m still using it in 2.1.0.

New versions = new functions and check improvents / test / review. Yes. That’s why i use specific versions when i have to monitor specific things (2.1.0 with Kubernetes for example).

Thanks again!

system · February 7, 2024, 10:05pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.