[Check_mk (english)] Process_perfdata.pl works slowly

Antti_Makela1 · January 7, 2013, 11:49am

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

···

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

Sebastian_Grewe · January 7, 2013, 12:28pm

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

···

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Antti_Makela1 · January 7, 2013, 12:43pm

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

···

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Sebastian_Grewe · January 7, 2013, 1:26pm

find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.

MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.

NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.

Cheers,
Sebastian

···

________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Antti_Makela1 · January 7, 2013, 3:03pm

Ok, setting up rrdcached seems to have an effect. I adjusted the default flush time from 300 seconds to 1800 and now things are smooth again. Thanks for the tip.

I'm still wondering about the " illegal attempt to update using time..." errors in the logs though. I mean, there's this bug, https://dev.icinga.org/issues/2964 - but like I said, I'm using Nagios. Oh, and I'm certain there is only one Nagios process

···

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 15:27
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.

MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.

NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.

Cheers,
Sebastian

________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Antti_Makela1 · January 8, 2013, 7:50am

Well, turned out things were not so smooth after the Nagios checking ramped up. After a few hours, even with rrdcached, the graph processing lagged behind checks. Eventually it seems that Nagios stopped scheduling checks altogether.

I'm going to try setting up tmpfs for the spool directory and remove the journaling from rrdcached next.

···

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: Mäkelä, Antti
Sent: 7. tammikuuta 2013 17:03
To: 'Sebastian Grewe'; checkmk-en@lists.mathias-kettner.de
Subject: RE: Process_perfdata.pl works slowly

Ok, setting up rrdcached seems to have an effect. I adjusted the default flush time from 300 seconds to 1800 and now things are smooth again. Thanks for the tip.

I'm still wondering about the " illegal attempt to update using time..." errors in the logs though. I mean, there's this bug, https://dev.icinga.org/issues/2964 - but like I said, I'm using Nagios. Oh, and I'm certain there is only one Nagios process

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 15:27
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.

MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.

NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.

Cheers,
Sebastian

________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Sebastian_Grewe · January 8, 2013, 8:01am

Yes, journaling should be disabled since it does a lot of IO by itself. Keep in mind that graph data is lost if rrdcached is not shutdown cleanly, e.g. Due to an unexpected reboot or killed process.

Putting spool on a tmpfs did not improve things by a whole lot. When playing with performance we have seen the best results putting rrd files on a tmpfs with periodic backups to a storage. We even tried shared tmpfs via NFS due to the size of the rrd files to relocate them on a machine with lots of memory to hold them. Network traffic increased a bit but it worked.

In the end we simply used clustered Icinga with Check_MK distributed monitoring to split the disk load on more machines. That solved our problems. Other solutions included large disk arrays - which was a huge waste of disks pace - or simply using a SSD.

Cheers,
Sebastian

···

On 08.01.2013, at 08:51, "Mäkelä, Antti" <Antti.Makela@vintor.fi> wrote:

Well, turned out things were not so smooth after the Nagios checking ramped up. After a few hours, even with rrdcached, the graph processing lagged behind checks. Eventually it seems that Nagios stopped scheduling checks altogether.

I'm going to try setting up tmpfs for the spool directory and remove the journaling from rrdcached next.

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: Mäkelä, Antti
Sent: 7. tammikuuta 2013 17:03
To: 'Sebastian Grewe'; checkmk-en@lists.mathias-kettner.de
Subject: RE: Process_perfdata.pl works slowly

Ok, setting up rrdcached seems to have an effect. I adjusted the default flush time from 300 seconds to 1800 and now things are smooth again. Thanks for the tip.

I'm still wondering about the " illegal attempt to update using time..." errors in the logs though. I mean, there's this bug, https://dev.icinga.org/issues/2964 - but like I said, I'm using Nagios. Oh, and I'm certain there is only one Nagios process

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 15:27
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.

MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.

NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.

Cheers,
Sebastian

________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Antti_Makela1 · January 8, 2013, 8:47am

Yeah, as a matter of fact we have recently split our existing Nagios setup to two using check_mk multisite. Looks like it needs to be distributed even further.

If I do a 'du' on the perfdata directory, there's about 11 gigabytes of information. Our other site, which seems to work ok (after rrdcached install), has only 3,5 gigabytes - so looks like the "manageable" size for RRD database on our current hardware is somewhere between those two.

Time to split things up even further I guess and get one more server.

···

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: Sebastian Grewe [mailto:S.Grewe2@bigpoint.net]
Sent: 8. tammikuuta 2013 10:02
To: Mäkelä, Antti
Cc: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Subject: Re: Process_perfdata.pl works slowly

Yes, journaling should be disabled since it does a lot of IO by itself. Keep in mind that graph data is lost if rrdcached is not shutdown cleanly, e.g. Due to an unexpected reboot or killed process.

Putting spool on a tmpfs did not improve things by a whole lot. When playing with performance we have seen the best results putting rrd files on a tmpfs with periodic backups to a storage. We even tried shared tmpfs via NFS due to the size of the rrd files to relocate them on a machine with lots of memory to hold them. Network traffic increased a bit but it worked.

In the end we simply used clustered Icinga with Check_MK distributed monitoring to split the disk load on more machines. That solved our problems. Other solutions included large disk arrays - which was a huge waste of disks pace - or simply using a SSD.

Cheers,
Sebastian

On 08.01.2013, at 08:51, "Mäkelä, Antti" <Antti.Makela@vintor.fi> wrote:

Well, turned out things were not so smooth after the Nagios checking ramped up. After a few hours, even with rrdcached, the graph processing lagged behind checks. Eventually it seems that Nagios stopped scheduling checks altogether.

I'm going to try setting up tmpfs for the spool directory and remove the journaling from rrdcached next.

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: Mäkelä, Antti
Sent: 7. tammikuuta 2013 17:03
To: 'Sebastian Grewe'; checkmk-en@lists.mathias-kettner.de
Subject: RE: Process_perfdata.pl works slowly

Ok, setting up rrdcached seems to have an effect. I adjusted the default flush time from 300 seconds to 1800 and now things are smooth again. Thanks for the tip.

I'm still wondering about the " illegal attempt to update using time..." errors in the logs though. I mean, there's this bug, https://dev.icinga.org/issues/2964 - but like I said, I'm using Nagios. Oh, and I'm certain there is only one Nagios process

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 15:27
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

find -type f -name "*.rrd" | wc -l or you would get the XML files too (which do one write per update too even with rrdcached enable AFAIK). I am estimating you have about 12500 files - if that's a single disk that's your bottleneck.

MULTIPLE files would just spread the writes to more files, the amount of writes is probably still the same.

NB: If anyone has other good solutions other than throwing RAIDs or SSDs into the mix let me know. Not being able to run graphs for ALL our services is the only bottleneck right now.

Cheers,
Sebastian

________________________________________
Von: Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 13:43
An: Sebastian Grewe; checkmk-en@lists.mathias-kettner.de
Betreff: RE: Process_perfdata.pl works slowly

Ok, find . -type f | wc shows about 25000 files, so I guess it's just too many. I'll try rrdcached.

Other approach might be to change RRD format to SINGLE instead of MULTIPLE, but I guess that would result in a bunch of new issues.

Thanks!

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Sebastian Grewe
Sent: 7. tammikuuta 2013 14:28
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly

How many RRD Files are written? We have had our cap at somewhere around 13000 RRD files updated on a RAID10 with 4 disks. Beyond that only in-memory storage was able to keep up with the write requests. We did manage to boost performance by using rrdcached which stores updates in memory and periodically writes them to disk (in larger chunks than simply updating a file EACH time an update comes in).

You could try to uses iostat or iotop but I am sure the disk is the bottleneck. Writing RRD files is VERY IO expensive

Cheers,

Sebastian

________________________________________
Von: checkmk-en-bounces@lists.mathias-kettner.de [checkmk-en-bounces@lists.mathias-kettner.de]" im Auftrag von "Mäkelä, Antti [Antti.Makela@vintor.fi]
Gesendet: Montag, 7. Januar 2013 12:49
An: checkmk-en@lists.mathias-kettner.de
Betreff: [Check_mk (english)] Process_perfdata.pl works slowly

Hi,

We have set up pnp4nagios with npcd + bulk for processing performance statistics.

The thing is, the processing seems to take a *really* long time. Even if I set process timeout to like half a hour, I occasionally get timeouts. Furthermore, some services have gaps in their graphs (but others don't). If I look at "top" it seems that the process_perfdata.pl processes are spending a lot of time in IOWait. Fsck says that file system is clean (ext4) though and HDD should be reasonably fast.

There are several errors like this in the perfdata.log file besides notifications of timeouts:

2013-01-07 09:16:44 [22575] [0] RRDs::update ERROR /usr/local/pnp4nagios/var/perfdata/switch_s01/_HOST__rtmin.rrd: illegal attempt to update using time 1357541160 when last update time is 1357541887 (minimum one second step)

From googling around, there apparently was an issue with Icinga which might have resulted in these, but we are using Nagios 3.4.3 instead. Maybe these are related to the slowness?

Is it possible that the above errors are simply caused by the fact that npcd is spawning several process_perfdata.pl-threads at once so performance data for later events might be inserted to the RRD before earlier events? The errors seem most prevalent in _HOST*.rrd:s which are reasonably quickly processed.

Any ideas on how to look into the slowness issue?

--
- Antti Mäkelä | Senior Architect | CCIE #20962 -
- Vintor Oy, Itsehallintokuja 6, 02600 Espoo | www.vintor.fi -

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Ashish_Jain · January 9, 2013, 7:29pm

Have you tried using ramdisk for spool and temp location used by Nagios? Also, use the rrdcached journaling on ramdisk too.

-Ashish

···

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Mäkelä, Antti
Sent: Tuesday, January 08, 2013 12:47 AM
To: Sebastian Grewe
Cc: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Process_perfdata.pl works slowly