Cpu usage after upgrade to 2.0.0p4

After upgrading cee from 1.6.0p24 to 2.0.0p4 my cpu usage is much higher.

And Fetcher helper usage is always 90-100%

Server is centos 7.9.2009, 4cpu, 8GB ram, 400 hosts/ 23000 services.
The version of the agents on the servers is 1.6.0p9-18

from top command i have
/omd/sites/chechmk/bin/cmk --checker
/omd/sites/chechmk/bin/fetcher (multiple times)

it’s normal ?
What does the fetcher helper do, and how can i help ?

With this version you have two different helper processes - fetchers and checkers.
Fetcher will only transfer the data from the agents/devices to the checkers to be checked then.
Can you have a look how many of these two helpers are defined in your system?
Rules are → Maximum concurrent Checkmk fetchers/checkers
You can also switch back to the old model with “Use separate fetchers and checkers” turned off.
This can be used to test if there is some difference in the load.

In your case you need more fetchers if they are nearly 100% in usage.

My settings are and i never changed it:
obraz

When i switch back to the old model “Use separate fetchers and checkers”, cpu dropped by 10% but stale services go to 20%

My statictics are:
obraz

Latency looks ok. I would decrease the check helpers a little bit and increase the fetcher helpers.
You can have also a view at the graphs from the “OMD nagios performance” to see how high your helper usage was with 1.6

Thank you for your help with helpers, I will try it.

And have you also noticed that when you change from 1.6 to 2.0, the use of cpu grow so much?

We are facing the same problem with 2.0.0p3 enterprise.

CPU wise we are OK, is the solution increasing the number of concurrent fetcher helpers?

What is the solution here, is there a consensus? Lukasz, have you solved your case?

Thank you,
Oskar

It is a possible problem to go from combined helpers as it was with 1.6 to the separate helpers for checks and to fetch data. You need to test and find the sweet spot between booth.
If you have only agents with a short runtime you need fewer fetchers compared with the same amount of hosts but all checked with snmp.
I think there is no general solution or good advice.

1 Like

Yes i incresed from 13 to 20 “Maximum concurrent Checkmk fetchers” from global settings and it works.

1 Like

Same issue for me with checkmk cee 2.0.0p12 update with 1.3K snmp hosts & 24K services. I had to move gradually ‘fetcher helper usage’ from 13 to 200 !
Now the fetcher helper usage drops from 99% to 54% :grinning:
Meanwhile the monitoring core services check rate rises from 33 services check/sec to 340/sec :grin:
Not a big surprise, now my checkmk VM memory is critical :laughing:
But I know well the sysadmins guys :wink:

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.