Well, turned out things were not so smooth after the Nagios checking ramped up. After a few hours, even with rrdcached, the graph processing lagged behind checks. Eventually it seems that Nagios stopped scheduling checks altogether.
I'm going to try setting up tmpfs for the spool directory and remove the journaling from rrdcached next.
--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -
-----Original Message-----
From: Mäkelä, Antti
Sent: 7. tammikuuta 2013 17:03
To: 'Sebastian Grewe'; checkmk-en@lists.mathias-kettner.de
Subject: RE: Process_perfdata.pl works slowly
Ok, setting up rrdcached seems to have an effect. I adjusted the default flush time from 300 seconds to 1800 and now things are smooth again. Thanks for the tip.
I'm still wondering about the " illegal attempt to update using time..." errors in the logs though. I mean, there's this bug, https://dev.icinga.org/issues/2964 - but like I said, I'm using Nagios. Oh, and I'm certain there is only one Nagios process 
--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -
-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 15:27
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly
find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.
MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.
NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.
Cheers,
Sebastian
________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly
Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.
Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.
Thanks!
--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -
-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly
How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).
You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive 
Cheers,
Sebastian
________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly
Hi,
We have set up pnp4nagios with npcd + bulk for processing performance statistics.
The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.
There are several errors like this in the perfdata.log file besides notifications of timeouts:
2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)
From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?
Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.
Any ideas on how to look into the slowness issue?
--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en