Checkmk fetcher helper usage over almost 100%

CMK Version 2.2.0p35cee
OS version: RHEL 7.9

Error message: None

Hi All,
I am facing this issue since I updated to 2.1 and we are up to 2.2.0p35cee now. Agent versions are 2.0.0p33 on linux hosts and the windows hosts 2.2.0p35
I have no problems with the checkmk fetcher helpers on my servers (distributed setup) except for two of them. These two servers have less services and less hosts than the other servers while they have the same configuration as the others. I tried a lot of things and searched the forums. Tried to up the number of helpers to even 500, but no difference. I get no errors in the debug logs.
I am stuck here. Monitoring seems to work reasonable. We tried to up the memory and CPU, but I actually don’t want add anymore, because it just seems to be down to these two and they sure have enough if the others are having no problems at all.
I took one of them out of distribution now to troubleshoot (again…), but nothing seems to have an impact, no matter what I do.


image

Any help would be appricated.

With kind regards,

Cornelis

Hi @cornelvis,

in this KB article you can find helpful hints to adjust the performance of checkmk:

https://checkmk.atlassian.net/wiki/spaces/KB/pages/9473747/How-to+adjust+Checkmk+performance

Regards,
Petra

Hi Petra,
Thanks for the quick reply, I will dig into it.

I read the KB article, but that doesn’t seem to help much. A lot I checked already.
I ran the query for the slow running active checks, but I am not sure what is long?

lq "GET services\nColumns: execution_time host_name display_name" | awk -F';' '{ printf("%.2f %s %s\n", $1, $2, $3)}' | sort -rn | head -20
16.82 <hostname> Check_MK
12.97 <hostname> Check_MK HW/SW Inventory
12.59 <hostname> Check_MK
12.27 <hostname> Check_MK
11.29 <hostname> Check_MK
11.09 <hostname> Check_MK
10.66 <hostname> Check_MK Discovery
10.52 <hostname> Check_MK HW/SW Inventory
10.51 <hostname> Check_MK
10.26 <hostname> Check_MK
10.22 <hostname> Check_MK
9.61 <hostname> Check_MK HW/SW Inventory
9.51 <hostname> Check_MK
9.42 <hostname> Check_MK
9.32 <hostname> Check_MK HW/SW Inventory
9.15 <hostname> Check_MK Discovery
9.08 <hostname> Check_MK
9.05 <hostname> Check_MK
8.92 <hostname> Check_MK
8.86 <hostname> Check_MK Discovery

Hi @cornelvis,

can you please share how much Memory and CPU’s your affected monitoring server currently got?

Thanks in advance!

Regards
Norm

Hi @Norm
Of course I can.
16 CPU
40GB RAM
They are VM’s on VMWare vSphere

Regards,

Cornelis

Two points

  • 16 cores should be more than enough for 300 hosts
  • if you see the whole time “Check_MK Discovery” and “Check_MK HW/SW Inventory” checks running - please check the “check interval” for these services. Normally they run only every 2/4/6 hours or so but not at every normal check interval.

You system looks like it runs these checks more often than it is needed.
Also i would inspect the “Check_MK” service runtime if it is not a SNMP device.
Agent devices should not take longer than 1-3 seconds . You screen shows way longer runtimes of the “Check_MK” service. This can lead also to problems.

1 Like

This made my day!
I created two rules now:


Now the OMD performance check is green again.

If I did something wrong or you have some concerns about the above, please let me know. I know my way around, but sometimes get lost in all the options in check_mk, plus I still see myself as a rookie, although I do work with every day :wink:

Best wishes already for 2025!

With kind regards,

Cornelis

15 minutes check interval for the “Check_MK” service itself is not good as all your services that are passive checks will only be checked every 15 minutes.
As i said i would only check the interval for the “Check_MK Discovery” and the “Check_MK HW/SW Inventory” service.
For the HW/SW inventory you should have a rule in your system like here.


The discovery check interval normally is defined with the rule “Periodic service discovery”.

In your last screenshot you have a complete other amount of hosts 796 vs. 289 as before.
With around 800 hosts the 16 cores can also be a little bit low.
Check_MK service runtime around 3 seconds → 3 * 800 = 2400 seconds needed / 16 cores → 150 seconds needed for a full check cycle. With a average runtime of 1 seconds the 16 cores would be enough.

Conclusion: more CPU cores needed for a practical check interval of your services.
I would normally raise the interval to not more than 2 minutes for the services.
For special services you can make an different configuration but the Check_MK service should be in the range 1-3 minutes.

Thanks for that, It’s weird that it shows so much hosts on the omd performance check and says something different on the main dashboard of that site… [edit] the screenshot taken is from the other site I have with the same problem. I will up the CPU

I already had the feeling that the 15 minutes wasn’t really correct. I removed the rule now and the fetcher usage is almost 100% again…

I already had a rule set up for “Check_MK HW/SW Inventory” and that is set to 1 day (not my doing, but it has been like that before I stepped in)

In Periodic service discovery I have a couple of rules, but none of them straight in Main
I think this one has the most impact, but this one I haven’t set up myself as well

I am not sure what to do now.

Regards,

Cornelis

Hi Cornelius,

May you check the ‘Time spent waiting for Checkmk agent’ of your check_mk service.
The longer the fetcher needs to fetch the date of your agents the more fetchers are occupied → The higher is the ‘Fetcher Helper Usage’.

I see you have a quite high number of services compared to the number of hosts. Possible you may have some long running plugins?
Also WAN speed sometimes issues run time of fetchers.

regards

Michael

Thanks for that!
I will dig into that.

With kind regards,

Cornelis van Ginkel

Hi Michael,

I tried to find this ‘Time spent waiting for Checkmk agent’ setting, but can’t find it (or it has not been configured)
Can you please point out where to find or set it?

Thanks in advance.

Regards,

Cornelis

Its a metric in the service Check_MK.
image

Ah sorry for that.
On one of the sites I have problems with it’s like this:


on the other

Whats the problem? 2.7 sec is quite good I would says

It’s still the same fetcher problem. I just responded to the question, so I posted the metrics as asked?

Sorry for misunderstanding. Have you checked the time waiting for all agents on this site?
The problem is a sum of all agent times on the site.

regards

Mike

After sitting with our new support partner, we found the culprit. The global setting for the checkmk fetchers gets overwritten in the site specifics settings in the distributed monitoring set up. We set some more fetchers there and all is OK there. I couldn’t find this in the documentation, maybe it is there somewhere, but…

Thanks all for the responses.

With kind regards,

Cornelis

3 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.