CMK version: check mk raw - 2.3.0p2
OS version: Debian 12
After the update to 2.3 none of our graphs were displaying. Updating to p2 has not fixed the problem.
We do see intermittent parts of graphs showing sometimes but it seems related to how far process_perfdata.pl gets before it times out.
So far I have tried the following:
- Increased CPU cores from 16->32->64 cores.
- Checked disk activity is not close to saturation
- Increased TIMEOUT = 59 in ~/etc/pnp4nagios/process_perfdata.cfg , much higher values donât work either.
- Increased npcd_max_threads = 30 in ~/etc/pnp4nagios/npcd.cfg
NPCD log is full of:
[05-16-2024 11:07:11] NPCD: ERROR: Command line was '/omd/sites/nlhaa1/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/nlhaa1/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850335'
[05-16-2024 11:07:11] NPCD: ERROR: Executed command exits with return code '7'
perfdata.log contains many lines such as
rrdtool update returns 256
Sometimes the return code is 512.
When the timeout occurs we get this:
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850380-PID-3272374 deleted
2024-05-16 11:08:25 [3272374] [0] *** Timeout while processing Host: "####" Service: "Postfix_Queue"
2024-05-16 11:08:25 [3272374] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850395-PID-3272376 deleted
2024-05-16 11:08:25 [3272376] [0] *** Timeout while processing Host: "####" Service: "fs__hispeed-storage"
2024-05-16 11:08:25 [3272376] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272379] [0] *** TIMEOUT: Timeout after 59 secs. ***
We have been using CheckMK for nearly 10 years and there hasnât been any recent increase in number of services monitored.
Is there anything else I can try?