Sysload extremely high for no reason

CMK version: 2.4.0p10 (Raw)
OS version: Ubuntu 22.04

Hello,

since a few days my CMK installation has gone completely wild. Checks time out frequently, hundrets of Discovery timeout errors are shown and the CPU utitization goes crazy and the Sysload is permanently extremely high.
There was no change at all to the infrastructure or to the monitoring server itself.

The CMK server has 10 CPU cores, 8 GB RAM, 4 GB Swap and the CPU graph shows a lot of userspace utilization and basically no I/O wait.

When I stop the monitoring site, then all of this goes away.
top shows a lot of python3 processes causing the high load.

Where do I start looking for the issue?

I use older version, lot smaller environment, but had similar issues. I couldn’t find out what it was until I bumped on following post where limiting concurrent checks made most of my monitoring issues vanish.

Thanks @Yggy! I have limited the concurrent checks to 100 which brought the CPU utitization back to a normal state.

But the Sysload is still way too high and spikes regularily. The green line marks the point of my change.

But as clearly visible in the 8 day view, this suddenly started without anything having changed on the monitoring server. There was also nothing added to monitor.

Hi, are you using checkmk on a virtual machine, e.g. VMware?

Regards, Christian

With 10 cores i would not go higher than 20 here.

It runs on VMware ESXi 7. But there hasn’t been any change to the infrastructure in weeks (no VMs added, no resources reallocated).

1 Like