Any tuning or configuration improvements that I can do to my Check-MK server?

jephilc · April 28, 2021, 12:35pm

Hi Everyone, I am using a Dell R640 for my Check-MK server but it seems to be overkill as it has 20 cores and 256 Gb Ram. Are there any tuning options that are available or would I be better virtualizing it into a smaller VM and making other use of the rest of the server?

I am running v2.0.0p1 Raw

Any thoughts appreciated - thanks

elias.voelker · April 28, 2021, 12:56pm

I bet many of our community members would like to have that problem

jephilc · April 28, 2021, 1:24pm

We wanted to implement a monitoring solution which includes some network switches, up to 100 iDRACS and 100 physical servers, along with a number of VMs as well. I was asking if there are any params that can be tweaked to give better performance of the server. My network isn['t great and it is busy but I do notice that a lot of services go stale for a few minutes when I add a new host. I can quite easily virtualize this server and give Check-Mk a core and maybe 16 or 32 Gb RAM, which will allow me to create more VMs on the server for other uses. Thanks

rprengel · April 28, 2021, 1:31pm

Hallo,
how many sites are planed and how many snmp deviced should be checked.
Are planing to use the eventconsole and bi?
Ralf

jephilc · April 28, 2021, 3:22pm

Hi Ralf - only 1 site is going to be used, but all the idracs, up to 100 and the networking devices, about 10, are using snmp - I am not sure yet about eventconsole and bi - thanks for your reply

linux_frickler · April 28, 2021, 4:40pm

I would stay with a single host in your case if possible. Splitting into several sites and multisite-architecture brings new problems (Piggyback, Eventconsole, …) and your hardware should be capable to handle the load.
Right now we have around 2000 devices per VM before we add another one. One could add more CPU/memory but activating changes just takes too long with big sites.
Have a look at the used helper-processes to tune them if neccessary.

EDIT: Sorry, just reread your post and saw that you want to use only one site anyway. But if your hosts go stale it may be the helper-processes are exhausted.

rprengel · April 28, 2021, 5:12pm

try and cry
I m using the snapin perfmeter to have a fast overview about the health.
We had server crashes of the vm appliance before the finetunig was ok.
Ralf

r.sander · April 28, 2021, 6:39pm

When using the raw edition I would suggest to create multiple sites with config sync to have multiple Nagios processes running.

Especially SNMP devices tend to use much time to query and Nagios is not very good in scaling that.

jephilc · April 29, 2021, 7:43am

Hi, many thanks for your response - I thought the helper processes were only in CMC /Enterprise - if Raw has the helper processes and they can be tuned, can you please provide some information or where to look on how I look at the status of them and tune them. Thanks

jplitza · April 29, 2021, 7:46am

I was able to improve performance for the Raw edition substantially by patching the ruleset matching code: ruleset_matcher: Do not look up labels for irrelevant hosts by jplitza · Pull Request #354 · tribe29/checkmk · GitHub

andreas-doehler · April 29, 2021, 8:25am

That is correct.
I would also vote for the points from @r.sander as core restart/sync takes some time for RAW systems.

If you want to calculate what your system can handle i do the following calculation.

Number of cores * 60 seconds → cpu seconds available per minute
Now you look at the time your check_mk services need (take the average) * number of devices → cpu needed to make all checks per minute (if you normal check interval is 1 minute)

If the second value is below the first one then all is fine.

jephilc · April 29, 2021, 8:47am

Thanks Andreas, I will try this and check. One of the issues I had, which really hasn’t helped is that some of the Dell server iDRACS are about 5 or 6 years old (iDRAC 8 with older firmware). I have had service checks running on occasions for over 13 minutes (these are snmp hosts) and then coming back with “item not found in monitoring data” for such things as temperatures of CPUs - the problem is that when I actually go to the iDRAC and check the values, they are actually not being shown to me, so check-mk is not wrong, it is only acting on the data it receives from the iDRAC - so there is a part of this issue that is most likely being caused by the old iDRACs themselves. I had to remove these old iDRAC hosts because they were causing bottlenecks in the checking - it seems to run much better on newer iDRAC9 servers.

I have quite a powerful server to run check-mk on, so I am confident it should be able to handle the number of hosts - the reason for this post was that I was looking for a way to be able to increase the number of processes that check-mk runs in order to do more in parallel when it comes to running these checks. Thanks again for the info.

andreas-doehler · April 29, 2021, 10:56am

As you are using the RAW edition, it does as many parallel checks as it can.
The maximum of parallel checks are the number of your host objects.
But as you said if some hosts are blocking single cores you will run in a bottleneck of available cores for the workload.
That’s why you should pay attention to the execution times of the Check_MK service.

Normal server (Linux/Win) 0,5-2 seconds
Normal switch 0,5-5 seconds
Bigger switch/stack 5-20 seconds
Special agents like Netapp and HP MSA - depends mostly on the size of this device from 1 to 60 seconds all is possible

jephilc · April 29, 2021, 11:02am

Thanks, that is useful information. For the majority of the iDRAC service discoveries, especially the newer iDRAC9 servers, the discovery/service checks take around 10 seconds which is fine (using snmp). I am hoping to tech refresh the old iDRAC servers this year with new servers. Thanks again.

system · April 29, 2022, 11:02am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact @fayepal if you think this should be re-opened.