CheckMK version: 2.3.0p1.cre RAW edition
OS version : RHEL 8.9
Hosts : 160
Services :4400
x8 CPU initially then upgraded to 12 cores
x16 Gb RAM
We are seeing multiple sysmptoms after upgrading from 2.2.0p24.cre first thing noticed was a
- massive increase in CPU load before the upgrade 15 min average was 4 - 8 (when we had 8 cores) after the upgrade 9 - 14
This was many many entries showing this type in top
2971476 MAM 20 0 94904 62184 19080 S 23.8 0.4 0:00.72 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode03
2971988 MAM 20 0 94768 62116 19256 S 22.8 0.4 0:00.69 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode04
2971989 MAM 20 0 94688 62196 19308 S 22.5 0.4 0:00.68 /omd/sites/MAM/bin/python3 /omd/sites/MAM/var/check_mk/core/helper_config/latest/host_checks/akvcsxcode05
- Graphing would regularly stop after being up maybe 10 - 15 minutes. When it was good we only see one process running for rrdcached for each of the two sites on this server, like this.
broadca+ 2681 0.0 0.1 786220 26744 ? Ssl May10 1:41 /omd/sites/broadcast/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s broadcast -m 660 -l unix:/omd/sites/broadcast/tmp/run/rrdcached.sock -p /omd/sites/broadcast/tmp/rrdcached.pid -j /omd/sites/broadcast/var/rrdcached -o /omd/sites/broadcast/var/log/rrdcached.log
When it dies we see hundreds of child processes belonging to rrdcached as well
The only way we are keeping it even slightly stable is by letting it run with rrdcached stopped, so we do not get any graphs but do get alerts and can see the current values being returned.
- afer reading some other threads I see people looking at /opt/omd/sites/MAM/var/pnp4nagios/log/npcd.log there are some worrying entries in there as well
[05-10-2024 13:57:05] NPCD: Logfile rotated!
[05-10-2024 13:57:07] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:57:07] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306175'
[05-10-2024 13:57:07] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:57:07] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306145'
[05-10-2024 13:58:02] NPCD: ERROR: Executed command exits with return code '25'
[05-10-2024 13:58:02] NPCD: ERROR: Command line was '/omd/sites/MAM/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/MAM/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/MAM/var/pnp4nagios/spool//perfdata.1715306235'
I tried to roll back but it said downgrading to lower minor version is not allowed unless you know what your doing. So I was not game to try that unless there are some good documentation about. I did do a site backup from the gui a few days ago while on the old version. What will happen if I try to restore that? Will it also roll back to the earlier version?