High CPU after 2.1.0p11/13 upgrade from 2.0.0p23

jamesatcamp · October 6, 2022, 2:30pm

Pretty sure there’s a bug in the cmc core. Changed the host check interval to 5 minutes (from the 6 second default) and the service check interval to 10 minutes (from the 1 minute default), and cranked up the logging on the core.

After restarting the site, cmc cpu is near zero until the service checks run at which point in the next 10 minute block where nothing is being checked, it remains near 100 percent. Monitoring the process with strace/perf top, it appears that the process gets stuck in some infinite polling loop.

I’ve watched the logs for a bit and have seen 3-4 repetitions of this:

2022-10-06 09:51:12 [3] [fetcher pool] [service "hostname;Check_MK"] [helper 1614055] [log] [cycle 46, command "130;hostname;checking;60"] memory usage increased from 42.90 MB to 118.02 MB, exiting
2022-10-06 09:51:16 [4] [fetcher pool] cannot send request: Broken pipe
2022-10-06 09:51:19 [3] [fetcher pool] [helper 1614055] exited with status 14
2022-10-06 09:51:19 [5] [fetcher pool] [helper 1631292] started, commandline: /omd/sites/my_site/bin/fetcher

… during each check interval. Watched for a bit and determined that all of the hosts are physical hosts with IPMI monitoring enabled. Disabled IPMI monitoring for them, and while the fetcher pool is no longer crashing, the cmc process is still reaches 100% CPU. So while there’s a memory leak or other issue in the IPMI implementation, not the cause.

I’ll continue to try and narrow down what checks/active checks/services might be causing it, and then figure out how to open a ticket.

Thanks