CheckMK 2.3 Raw Graphs Fail, rrdtool errors and NPCD timeout

CMK version: check mk raw - 2.3.0p2
OS version: Debian 12

After the update to 2.3 none of our graphs were displaying. Updating to p2 has not fixed the problem.
We do see intermittent parts of graphs showing sometimes but it seems related to how far process_perfdata.pl gets before it times out.
So far I have tried the following:

  • Increased CPU cores from 16->32->64 cores.
  • Checked disk activity is not close to saturation
  • Increased TIMEOUT = 59 in ~/etc/pnp4nagios/process_perfdata.cfg , much higher values don’t work either.
  • Increased npcd_max_threads = 30 in ~/etc/pnp4nagios/npcd.cfg

NPCD log is full of:

[05-16-2024 11:07:11] NPCD: ERROR: Command line was '/omd/sites/nlhaa1/lib/pnp4nagios/process_perfdata.pl -n -c /omd/sites/nlhaa1/etc/pnp4nagios/process_perfdata.cfg -b /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850335'
[05-16-2024 11:07:11] NPCD: ERROR: Executed command exits with return code '7'

perfdata.log contains many lines such as

rrdtool update returns 256

Sometimes the return code is 512.
When the timeout occurs we get this:

2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272374] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850380-PID-3272374 deleted
2024-05-16 11:08:25 [3272374] [0] *** Timeout while processing Host: "####" Service: "Postfix_Queue"
2024-05-16 11:08:25 [3272374] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Timeout after 59 secs. ***
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2024-05-16 11:08:25 [3272376] [0] *** TIMEOUT: /omd/sites/nlhaa1/var/pnp4nagios/spool//perfdata.1715850395-PID-3272376 deleted
2024-05-16 11:08:25 [3272376] [0] *** Timeout while processing Host: "####" Service: "fs__hispeed-storage"
2024-05-16 11:08:25 [3272376] [0] *** process_perfdata.pl terminated on signal ALRM
2024-05-16 11:08:25 [3272379] [0] *** TIMEOUT: Timeout after 59 secs. ***

We have been using CheckMK for nearly 10 years and there hasn’t been any recent increase in number of services monitored.

Is there anything else I can try?

Hi,

there seems to be a bug affecting a lot users currently.

@martin.hirschvogel Are you guys ware of the issue ?

I ended up restoring from backup, to 2.2 - 2.3 was more or less unusable for me/us, because of this.

Originally running on 4 cores, even increasing to 16 was not enough - similar to your 4x increase.

The increased load generally was quite ‘bonkers’ - causing various checks to timeout, missing graphs etc.

Back on 2.2 now, everything is fine again.

Yeah I was trying to avoid going back to a snapshot :slight_smile: Other than the other issue mentioned by @aeckstein I wasn’t confident that many people were having the problem.

Also this:


 albeit I’m on Raw, not Enterprise.

But I just cannot justify the 16 core Linode cost to run this system - but it didn’t seem to help/fix the issue with graphs anyway, even the load increase was crazy.

Aside from the various points you note in your original post, I did also try removing all perf data and starting afresh - in case there was some form of erroneous data, corruption, etc - which didn’t help either.

That will not help. I would only try to disable performance data processing inside the “Master switch” and then have a look at the CPU performance.

Yep tried emptying backlogs too. Load is only at 9 with 64 cores and disk activity is fine.

Just tried that and it doesn’t seem to have had a noticeable effect. Maybe slightly lower. Load is around 8 now.

Now you see only the Python processes that consume the CPU?

rrdcached is using a fair bit followed by nagios

perf top gives me this

  11.99%  libpython3.12.so.1.0                              [.] _PyEval_EvalFrameDefault.cold                                                                                                                                                                            
   3.32%  libpython3.12.so.1.0                              [.] gc_collect_main                                                                                                                                                                                          
   2.95%  libpython3.12.so.1.0                              [.] _Py_dict_lookup                                                                                                                                                                                          
   2.32%  libpython3.12.so.1.0                              [.] deduce_unreachable                                                                                                                                                                                       
   1.76%  [kernel]                                          [k] _compound_head                                                                                                                                                                                           
   1.75%  ld-linux-x86-64.so.2                              [.] __tls_get_addr                                                                                                                                                                                           
   1.30%  libpython3.12.so.1.0                              [.] visit_reachable                                                                                                                                                                                          
   1.17%  [kernel]                                          [k] clear_page_rep                                                                                                                                                                                           
   1.04%  [kernel]                                          [k] next_uptodate_page                                                                                                                                                                                                  

We will also look into it. But keep it coming here, any real data helps us to understand more about the issue.

1 Like

This should only consume performance if someone is using the GUI.
Nagios is ok as you check the configured services.

Do the return codes from rrdtool mean anything? I can’t find much about them. Do non-zero return codes suggest an error?

Hi,
could you please share the following log files with us:
var/pnp4nagios/log/npcd.log
var/pnp4nagios/log/perfdata.log
var/log/rrdcached.log

Also, if you check the contents of var/pnp4nagios/spool/, are there any files that look suspiciously old or suspiciously big? Sorry, I can’t be more specific right now, since I am not sure either what exactly we are looking for.
Best
Jörg

1 Like

Thanks for sending the files. Please increase the log level of process_perfdata.pl in etc/pnp4nagios/process_perfdata.cfg:

#
# Loglevel 0=silent 1=normal 2=debug
#
LOG_LEVEL = 2

Then please restart the site. We should then get much more output in process_perfdata.log. What I am interested in: Do you get lines such as this one:

2024-05-17 09:09:21 [199994] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.

Thanks!

1 Like

Yes! Every call looks like this

2024-05-17 13:41:43 [262158] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.
2024-05-17 13:41:43 [262158] [2] /omd/sites/nlhaa1/bin/rrdtool update --daemon=unix:/omd/sites/nlhaa1/tmp/run/rrdcached.sock /omd/sites/nlhaa1/var/pnp4nagios/perfdata/####/Memory_commit_limit.rrd 1715946037:8392101888
2024-05-17 13:41:43 [262160] [1] rrdtool update returns 0

Ok. I think we have a workaround. Please follow these steps:

  1. Stop your site.
  2. Copy the following files from a current 2.2 installation to your 2.3 installation:
$ sudo cp -r /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
$ sudo cp -r /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
$ sudo cp /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
$ sudo cp /opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/RRDp.pm /opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/RRDp.pm

Note that the file layout might be different on your system, particularly wrt. the folder x86_64-linux-gnu-thread-multi. The following command tells you what files and folders you need to copy:

$ sudo find /opt/omd/versions/<2.2.0-version>/ -name "*RRD*"
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
/opt/omd/versions/<2.2.0-version>/lib/perl5/lib/perl5/RRDp.pm

Afterwards, your 2.3 version should have the following files:

$ sudo find /opt/omd/versions/<2.3.0-version>/ -name "*RRD*"
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDp
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/RRDs.pm
/opt/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5/RRDp.pm
  1. Add this to /opt/omd/versions/<2.3.0-version>/lib/pnp4nagios/process_perfdata.pl, line 20, which should be empty:
use lib '/omd/versions/<2.3.0-version>/lib/perl5/lib/perl5';
  1. Start your site.

Afterwards, the log lines warning about the RRDs Perl Modules should be gone:

2024-05-17 13:41:43 [262158] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.

Instead, you should see something like this:

2024-05-17 13:07:17 [532808] [2] RRDs::update --daemon=unix:/omd/sites/old/tmp/run/rrdcached.sock /omd/sites/old/var/pnp4nagios/perfdata/old/Check_MK_cmk_time_agent.rrd 1715944032:1.340

If this works, don’t forget to reduce the log level in etc/pnp4nagios/process_perfdata.cfg back to 0.
Best
Jörg

6 Likes

@joerg.herbel Thanks this has fixed the problem. Everything has returned to nor
mal. The errors have been replaced with the RRDs::update --daemon=unix lines as you expected.
I guess this means rrdcached is being used again?

It means that things are working again as they should :slight_smile: To be clear: rrdcached was used before as well. The issue was that process_perfdata.pl, which is called by the NPCD to feed performance data to rrdcached, was missing the bindings to directly interact with rrdtool. These bindings are the missing files that you copied. So, process_perfdata.pl instead launched storms of subprocesses to talk to rrdcached. This caused the perfomance issues and the gaps in the graphs. process_perfdata.pl was not able to keep up with the incoming spool data, so more and more data piled up, hence more and more subprocesses and so on.

6 Likes

I have done as per the above but it doesn’t yet seem to be working for me.

rrdtool update returns 0
2024-05-17 20:53:44 [4168079] [2] RRDs Perl Modules are not installed. Falling back to rrdtool system call.
2024-05-17 20:53:44 [4168079] [2] /omd/sites/master/bin/rrdtool update /omd/sites/prod/var/pnp4nagios/perfdata/redacted/Filesystem__export_home_igrace_growth.rrd 1715951208:0
2024-05-17 20:53:44 [4168039] [0] /omd/sites/master/bin/rrdtool update /omd/sites/prod/var/pnp4nagios/perfdata/redacted/Filesystem__export_home_rfrylinck_fs_used_percent.rrd 1715951139:0.000111
find /opt/omd/versions/2.3.0p2.cre/ -name "*RRD*"
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/RRDp.pm
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/RRDs.pm
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs/RRDs.so
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDs/RRDs.so
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDp
/opt/omd/versions/2.3.0p2.cre/lib/perl5/lib/perl5/x86_64-linux-thread-multi/auto/RRDp/RRDp

However I still have a huge spool backlog, so I am hoping once this clear I will see the errors lessen. I will update in the morning.