Saturation CPU OOM and Stale Host

Good morning everyone,
I installed checkcmk 2.4.0p20 raw on an Ubuntu 24.04 LTS server with 4 vCPUs and 8 GB of RAM.
I have about 320 active hosts for ~7,000 services. I’m having the problem that the services/hosts keep crashing.
Checking the server-side processes, I noticed that many Python 3 scripts are being executed, and the plugin installed on the host isn’t being used. Could you help me?

1 Like

As you use the RAW edition there is one important thing you should do on your system.

  • Configuration of parallel checks inside the Nagios core
    • ~/etc/nagios/nagios.d/timing.cfg –> max_concurrent_checks=0 should be changed to a value not higher than the double amount of cpu cores available

I would say for RAW edition 4 cores is way too low.

1 Like

By doing this, am I not at risk of losing checks or getting stale?
ideally how many would be ideal?

If you get stales you know that you need more CPU cores.

As i said not more than double amount of CPU cores.

Or not higher than the number of hyper threading threads?

You can also say it this way. But it only applies to RAW edition^^

1 Like

In Ansible facts there is processor_cores and processor_nproc. For VMs both are the same (because processor_threads_per_core is 1).

It is easier to just use processor_nproc to set max_concurrent_checks.

thnx a lot for tips.

have u got a more tips for obtain best performance?

A while ago I’ve done some testing wrt. how efficient hyper threading is by compiling a rather demanding C++ application in parallel on an AMD Ryzen 5950 which has 16 real CPU cores, 32 with HT. Due to how C++ works each compilation unit (.cpp file) took quite a number of seconds to compile, meaning the overhead from starting processes etc. was dwarfed by the raw computational need. I won’t go into details and methodology too much, but the result was pretty clear:

Going from 8 parallel processes to 16 roughly halved compilation time — as expected, as I’m using real cores here. Going from 16 to 32, though, only resulted in a 20% gain, showing how little HT can effectively achieve in this kind of scenario. I usually only consider real cores when sizing new virtualization hosts, too, for the same reason.

Which interval are you using for service checks? If using the default you can try increasing from 1 to 5 minutes which might be enough. Be aware that SNMP devices might require more time and processing power than a standard checkmk agent.

can u say the name of services for timing?

Look for “Normal check interval for service checks”