CheckMK 2.3 Raw Graphs Fail, rrdtool errors and NPCD timeout

HobbledGrubs · May 16, 2024, 9:13am

CMK version: check mk raw - 2.3.0p2
OS version: Debian 12

After the update to 2.3 none of our graphs were displaying. Updating to p2 has not fixed the problem.
We do see intermittent parts of graphs showing sometimes but it seems related to how far process_perfdata.pl gets before it times out.
So far I have tried the following:

Increased CPU cores from 16->32->64 cores.
Checked disk activity is not close to saturation
Increased TIMEOUT = 59 in ~/etc/pnp4nagios/process_perfdata.cfg , much higher values don’t work either.
Increased npcd_max_threads = 30 in ~/etc/pnp4nagios/npcd.cfg

NPCD log is full of:

[05-16-2024 11:07:11] NPCD: ERROR: Command line was '/omd/sites/nlhaa1/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/nlhaa1/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850335'
[05-16-2024 11:07:11] NPCD: ERROR: Executed command exits with return code '7'

perfdata.log contains many lines such as

rrdtool update returns 256

Sometimes the return code is 512.
When the timeout occurs we get this:

2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850380-PID-3272374 deleted
2024-05-16 11:08:25 [3272374] [0] *** Timeout while processing Host: "####" Service: "Postfix_Queue"
2024-05-16 11:08:25 [3272374] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850395-PID-3272376 deleted
2024-05-16 11:08:25 [3272376] [0] *** Timeout while processing Host: "####" Service: "fs__hispeed-storage"
2024-05-16 11:08:25 [3272376] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272379] [0] *** TIMEOUT: Timeout after 59 secs. ***

We have been using CheckMK for nearly 10 years and there hasn’t been any recent increase in number of services monitored.

Is there anything else I can try?

aeckstein · May 16, 2024, 9:48am

Hi,

there seems to be a bug affecting a lot users currently.

@martin.hirschvogel Are you guys ware of the issue ?

iMiMx · May 16, 2024, 10:03am

I ended up restoring from backup, to 2.2 - 2.3 was more or less unusable for me/us, because of this.

Originally running on 4 cores, even increasing to 16 was not enough - similar to your 4x increase.

The increased load generally was quite ‘bonkers’ - causing various checks to timeout, missing graphs etc.

Back on 2.2 now, everything is fine again.

HobbledGrubs · May 16, 2024, 11:23am

Yeah I was trying to avoid going back to a snapshot Other than the other issue mentioned by @aeckstein I wasn’t confident that many people were having the problem.

iMiMx · May 16, 2024, 11:27am

Also this:

… albeit I’m on Raw, not Enterprise.

But I just cannot justify the 16 core Linode cost to run this system - but it didn’t seem to help/fix the issue with graphs anyway, even the load increase was crazy.

Aside from the various points you note in your original post, I did also try removing all perf data and starting afresh - in case there was some form of erroneous data, corruption, etc - which didn’t help either.

andreas-doehler · May 16, 2024, 11:36am

That will not help. I would only try to disable performance data processing inside the “Master switch” and then have a look at the CPU performance.

HobbledGrubs · May 16, 2024, 11:36am

Yep tried emptying backlogs too. Load is only at 9 with 64 cores and disk activity is fine.

HobbledGrubs · May 16, 2024, 11:45am

Just tried that and it doesn’t seem to have had a noticeable effect. Maybe slightly lower. Load is around 8 now.

andreas-doehler · May 16, 2024, 11:53am

Now you see only the Python processes that consume the CPU?

HobbledGrubs · May 16, 2024, 11:58am

rrdcached is using a fair bit followed by nagios

perf top gives me this

  11.99%  libpython3.12.so.1.0                              [.] _PyEval_EvalFrameDefault.cold                                                                                                                                                                            
   3.32%  libpython3.12.so.1.0                              [.] gc_collect_main                                                                                                                                                                                          
   2.95%  libpython3.12.so.1.0                              [.] _Py_dict_lookup                                                                                                                                                                                          
   2.32%  libpython3.12.so.1.0                              [.] deduce_unreachable                                                                                                                                                                                       
   1.76%  [kernel]                                          [k] _compound_head                                                                                                                                                                                           
   1.75%  ld-linux-x86-64.so.2                              [.] __tls_get_addr                                                                                                                                                                                           
   1.30%  libpython3.12.so.1.0                              [.] visit_reachable                                                                                                                                                                                          
   1.17%  [kernel]                                          [k] clear_page_rep                                                                                                                                                                                           
   1.04%  [kernel]                                          [k] next_uptodate_page

martin.hirschvogel · May 16, 2024, 12:04pm

We will also look into it. But keep it coming here, any real data helps us to understand more about the issue.

andreas-doehler · May 16, 2024, 12:28pm

This should only consume performance if someone is using the GUI.
Nagios is ok as you check the configured services.

HobbledGrubs · May 16, 2024, 1:03pm

Do the return codes from rrdtool mean anything? I can’t find much about them. Do non-zero return codes suggest an error?

joerg.herbel · May 16, 2024, 4:48pm

Hi,
could you please share the following log files with us:
var/pnp4nagios/log/npcd.log
var/pnp4nagios/log/perfdata.log
var/log/rrdcached.log

Also, if you check the contents of var/pnp4nagios/spool/, are there any files that look suspiciously old or suspiciously big? Sorry, I can’t be more specific right now, since I am not sure either what exactly we are looking for.
Best
Jörg

joerg.herbel · May 17, 2024, 9:17am

Thanks for sending the files. Please increase the log level of process_perfdata.pl in etc/pnp4nagios/process_perfdata.cfg:

#
# Loglevel 0=silent 1=normal 2=debug
#
LOG_LEVEL = 2

Then please restart the site. We should then get much more output in process_perfdata.log. What I am interested in: Do you get lines such as this one:

2024-05-17 09:09:21 [199994] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.

Thanks!

HobbledGrubs · May 17, 2024, 11:42am

Yes! Every call looks like this

2024-05-17 13:41:43 [262158] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.
2024-05-17 13:41:43 [262158] [2] /omd/sites/nlhaa1/bin/rrdtool update --daemon=unix:/omd/sites/nlhaa1/tmp/run/rrdcached.sock /omd/sites/nlhaa1/var/pnp4nagios/perfdata/####/Memory_commit_limit.rrd 1715946037:8392101888
2024-05-17 13:41:43 [262160] [1] rrdtool update returns 0

joerg.herbel · May 17, 2024, 12:39pm

Ok. I think we have a workaround. Please follow these steps:

Stop your site.
Copy the following files from a current 2.2 installation to your 2.3 installation:

$ sudo cp -r /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
$ sudo cp -r /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
$ sudo cp /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
$ sudo cp /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/RRDp.pm /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/RRDp.pm

Note that the file layout might be different on your system, particularly wrt. the folder x86_64-linux-gnu-thread-multi. The following command tells you what files and folders you need to copy:

$ sudo find /opt/omd/versions/<2.2.0-version>/ -name "*RRD*"
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/RRDp.pm

Afterwards, your 2.3 version should have the following files:

$ sudo find /opt/omd/versions/<2.3.0-version>/ -name "*RRD*"
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/RRDp.pm

Add this to /opt/omd/versions/<2.3.0-version>/lib/pnp4nagios/process_perfdata.pl, line 20, which should be empty:

use lib '/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5';

Start your site.

Afterwards, the log lines warning about the RRDs Perl Modules should be gone:

2024-05-17 13:41:43 [262158] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.

Instead, you should see something like this:

2024-05-17 13:07:17 [532808] [2] RRDs::update --daemon=unix:/omd/sites/old/tmp/run/rrdcached.sock /omd/sites/old/var/pnp4nagios/perfdata/old/Check_MK_cmk_time_agent.rrd 1715944032:1.340

If this works, don’t forget to reduce the log level in etc/pnp4nagios/process_perfdata.cfg back to 0.
Best
Jörg

HobbledGrubs · May 17, 2024, 4:18pm

@joerg.herbel Thanks this has fixed the problem. Everything has returned to nor
mal. The errors have been replaced with the RRDs::update --daemon=unix lines as you expected.
I guess this means rrdcached is being used again?

joerg.herbel · May 17, 2024, 4:41pm

It means that things are working again as they should To be clear: rrdcached was used before as well. The issue was that process_perfdata.pl, which is called by the NPCD to feed performance data to rrdcached, was missing the bindings to directly interact with rrdtool. These bindings are the missing files that you copied. So, process_perfdata.pl instead launched storms of subprocesses to talk to rrdcached. This caused the perfomance issues and the gaps in the graphs. process_perfdata.pl was not able to keep up with the incoming spool data, so more and more data piled up, hence more and more subprocesses and so on.

hmaal · May 17, 2024, 6:57pm

I have done as per the above but it doesn’t yet seem to be working for me.

rrdtool update returns 0
2024-05-17 20:53:44 [4168079] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.
2024-05-17 20:53:44 [4168079] [2] /omd/sites/master/bin/rrdtool update /omd/sites/prod/var/pnp4nagios/perfdata/redacted/Filesystem__export_home_igrace_growth.rrd 1715951208:0
2024-05-17 20:53:44 [4168039] [0] /omd/sites/master/bin/rrdtool update /omd/sites/prod/var/pnp4nagios/perfdata/redacted/Filesystem__export_home_rfrylinck_fs_used_percent.rrd 1715951139:0.000111

find /opt/omd/versions/2.3.0p2.cre/ -name "*RRD*"
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/RRDp.pm
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/RRDs.pm
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs/RRDs.so
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDp
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDp/RRDp

However I still have a huge spool backlog, so I am hoping once this clear I will see the errors lessen. I will update in the morning.