CPU Utilisation and Optimisation on CheckMK Raw

CMK version: 2.2.0p19
OS version: Ubuntu 22.04

Hello,

So I’m getting less only 50% CPU utilisation on my CheckMK raw instance.
Currently I’m running it on AWS from a docker container in an EC2 instance on an m6i.xlarge (4 vCPU and 16 RAM). I know that the RAM is not needed and will probably move to a compute optimised instance. Although I suspect I’ll still face the same issue - High load with less than 50% CPU utilisation (screenshots below). Is it possible to make the checks run in paralel on or in groups on the raw version (Example first 50 hosts run in 1 minute, than on the 2nd minute the next 50 hosts run, not all at once every minute).
Currently I’m running 1500 services and 150 hosts, but I’ll be adding over 4x the amount of services
and probably double the hosts.
Adding more hosts make the system unstable (UI starts to begin slugish) and makes the load go to 20, 25, 30.


Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

This is a Nagios issue. You can extend the check interval for the Check_MK service check (that pulls the agent data) to 2 or 3 minutes.
The general recommendation is to leave Nagios and go with the Enterprise edition with more than 100 hosts.
Anither solution would be to create multiple sites. Each runs its own Nagios core. You would have to distribute your hosts evenly across the sites.

1 Like

Setup | Maintenance | Analyze configuration gives some hints to finetune some settings.
Click on the blue circles to get more info.

GUI being slugish could be an Apache process usage issue.

In my case, I need to look into the number of Apache processes. Probably need to lower them a bit.

More info on Analyze configuration:

3 Likes

Wont extending the check interval only decrease the number of times the hosts and services get checked? This sounds to be beneficial only when I’m getting CPU limited (100% not 50%, I may be completely wrong on how I understand this). Also if I go with running multiple sites, can I specify running one site after another?

No

It looks more like an I/O issue on your machine. A load of 3,5 in average with 4 cores and only 40% utilization means that your processes need to wait for execution.
If it is a disc or network I/O problem, is not possible to say with the shown data.

Opening this make me saw that the life status usage is full :


I’ve also increased the number of apache processes, but I don’t think the sluggishness is being caused by that, more because of no CPU availability.

Don’t increase - you need to decrease it :slight_smile:
The livestatus 100% is normal für RAW edition. There is no real livestatus usage.
This is only usable inside enterprise with CMC core.

2 Likes

Well currently the EC2 instance is running on GP3 volume (which is SSD).
This is what I can see from AWS monitoring and CheckMK for Disk utilisation :



Network doesn’t feel to be a bottleneck as well, the instance should be able to cover 10 or 12.5GBit of network traffic, which by the looks is less than 1% currently. (Also I’m using Wireguard to connect, but I don’t think this should be causing issues.)