Extreme Load For Python3 & Nagios Processes

smoyes · August 20, 2021, 2:28pm

Hi,
I’ve created a RHEL7 VM with 16Gb RAM and 8 CPU, running version 2.0.0p8. I’ve got everything I need working great apart from the system load. I’ve configured 480 clients monitoring ~19500 services. The server fluctuates between 30 and 50 and did go up to 300 when I did a bulk discovery. This load seems excessive to me. If this is “normal” for the number of services, then I’ll build a distributed cluster. All the agents are version 1.6, so I’m looking at updating these anyway. Any thoughts?

Cheers

andreas-doehler · August 20, 2021, 2:36pm

Let us do a small calculation

Raw edition with Nagios core and 480 clients with default check interval of 1 minute

480 calls to the precompiled checks and 480 check_icmp per minute
→ ca 1k process creations per minute
now you need to look at the runtime of one “check_mk” service this is the time the system need to query a single client. Normally i assume here between 0,5 and 2 seconds for a normal Linux/Windows system.

That means you need between 500 and 2000 cpu seconds per minute. Now you have 8 CPUs.
The result is between 62 and 250 seconds are needed on your system to get all the work done in 1 minute.
You see in the best case you need more time than you have available.

rprengel · August 21, 2021, 4:57am

Hallo Andreas,
do you have a similar example for the cee version?
Ralf

andreas-doehler · August 21, 2021, 5:48am

For cee remove the ping process creation and the rest stays nearly the same.
It is possible that the overall runtime of the check_mk service is lower but also here this depends mostly on the queried systems.
Extreme example with short answer time from agent (0,1second) then you can have a system with 8 cores to monitor around 4 or 5k servers. But if you only monitor switches (3-5seconds answer time) then you will only be happy with a maximum of 200 hosts.
If you can make such an easy calculation also with 2.0 i cannot say until now. Last week i migrated the first bigger systems to 2.0. The next weeks will show if it works also there.

system · August 21, 2022, 5:49am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.