CMK version: 2.0.0p18 (CRE)
OS version: Debian 10
since we upgraded from 2.0.0p5 (CRE) to 2.0.0p17 (CRE) our CheckMK isnt running smooth anymore.
Sometimes Nagios crashes (seems because of CPU spikes), high CPU usage which leads to flapping and in general the response time of the web UI is super slow. Takes one to ten seconds until the page loads if you navigate through.
Seems it has to do with distributed Monitoring and the slave sites, cause when we disable the slave sites it is running way more smoother (like normal). Attached a sample config of our slave connections.
The VM has many times high CPU peaks which leads to flapping of all hosts and timeouts to the slave sites.
Attached a screenshot of the CPU usage of the last 6 weeks. We did the upgrade to p17 on the 12th of December.
As you can see since then it does not run smooth. At the points where the CPU goes nearly to zero there the nagios crashed and we needed to restart the server.
Also attached the overview of our hosts and services.
We use distributed monitoring with one master site and 3 slaves.
Hardware of the Linux VM:
12 GB RAM
Is there anything we can do to achieve more stability and performance?
We already found the official documentation of distributed monitoring:
There it says: “By reducing the status host’s proof interval from the default of sixty seconds to, e.g. five seconds, you can minimize the duration of a hangup”
Where can we adjust this?