Regular crashing after upgrade to 2.3.0p1 rrdcached

CheckMK version: 2.3.0p1.cre RAW edition
OS version : RHEL 8.9
Hosts : 160
Services :4400
x8 CPU initially then upgraded to 12 cores
x16 Gb RAM

We are seeing multiple sysmptoms after upgrading from 2.2.0p24.cre first thing noticed was a

  1. massive increase in CPU load before the upgrade 15 min average was 4 - 8 (when we had 8 cores) after the upgrade 9 - 14

This was many many entries showing this type in top

2971476 MAM       20   0   94904  62184  19080 S  23.8   0.4   0:00.72 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode03                                          
2971988 MAM       20   0   94768  62116  19256 S  22.8   0.4   0:00.69 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode04                                          
2971989 MAM       20   0   94688  62196  19308 S  22.5   0.4   0:00.68 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode05
  1. Graphing would regularly stop after being up maybe 10 - 15 minutes. When it was good we only see one process running for rrdcached for each of the two sites on this server, like this.
broadca+    2681  0.0  0.1 786220 26744 ?        Ssl  May10   1:41 /omd/sites/broadcast/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s broadcast -m 660 -l unix:/omd/sites/broadcast/tmp/run/rrdcached.sock -p /omd/sites/broadcast/tmp/rrdcached.pid -j /omd/sites/broadcast/var/rrdcached -o /omd/sites/broadcast/var/log/rrdcached.log

When it dies we see hundreds of child processes belonging to rrdcached as well

The only way we are keeping it even slightly stable is by letting it run with rrdcached stopped, so we do not get any graphs but do get alerts and can see the current values being returned.

  1. afer reading some other threads I see people looking at /opt/omd/sites/MAM/var/pnp4nagios/log/npcd.log there are some worrying entries in there as well
[05-10-2024 13:57:05] NPCD: Logfile rotated!
[05-10-2024 13:57:07] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:57:07] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306175'
[05-10-2024 13:57:07] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:57:07] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306145'
[05-10-2024 13:58:02] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:58:02] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306235'

I tried to roll back but it said downgrading to lower minor version is not allowed unless you know what your doing. So I was not game to try that unless there are some good documentation about. I did do a site backup from the gui a few days ago while on the old version. What will happen if I try to restore that? Will it also roll back to the earlier version?

There is a workaround for this. Described in:

Thanks, I managed to restore to an earlier backup for the server I was talking about in the post above to v2.2, just lost a week or so of data, no big deal. We did however at the same time stand up a second new server for another team again with about 150 hosts and have only had v2.3 on it. Would it work if I copied the files between servers from 2.2 on my old site to 2.3 for the other team, or are those files specific to the omd sites

Just update to 2.3.0p3

1 Like