[Check_mk (english)] how to read the speed-o-meter and general perf talk

Jason_Humes · August 2, 2012, 12:26pm

Hi

Just looking at how well our server is performing…perhaps you could give me some thoughts/feedback.

We’ve currently got 436 hosts with a total of 20276 services being monitored. The server seems pretty responsive but we’re starting to see a bit of lag in the popup graphs and a couple of our special templates only half draw on the first
click, the second click will draw the full dataset…wondering if we’re starting to hit a performance ceiling. The speed-o-meter always reads full right/100 and hovering it pops up ‘Scheduled check-rate 338.8/s, current rate 492.6/s, that is 145% of the scheduled
rate’

Is that 145% of the scheduled rate a bad thing or a good thing? Seems like it would mean that we’re currently checking faster than scheduled, which would be good…and mean we’ve still got some breathing room…but I still feel a little
disk I/O bound recently.

Are there any disk tweaks anyone uses to speed up RRD perf? In past systems we’ve tuned such things as mount options on the RRD partition (nodiratime, noatime, dir_index and data=writeback option) and also reducing block level read-a-head:
(blockdev --setra 64 /dev/xxxxx)

Thoughts?

Thanks J

Jason D. Humes

Security Engineer

Applied Computer Solutions Inc.

3020 St. Etienne Blvd

Windsor, Ontario, N8W 5E6

Phone: (519) 944-4300x211

Florian_Heigl1 · August 21, 2012, 1:21pm

Just found this post, that would have answered pretty many of the
questions
Answering "for the record".

On Thu, 2 Aug 2012 12:26:01 +0000

We've currently got 436 hosts with a total of 20276 services being
monitored. The server seems pretty responsive but we're starting to
see a bit of lag in the popup graphs and a couple of our special
templates only half draw on the first click, the second click will
draw the full dataset...wondering if we're starting to hit a
performance ceiling. The speed-o-meter always reads full right/100
and hovering it pops up 'Scheduled check-rate 338.8/s, current rate
492.6/s, that is 145% of the scheduled rate'

it should normally stay around 100%.
Reasons for it exceeding 100% are for example on-demand hostchecks, or
instability of the Nagios performance (as you were experiencing I think)
Then it will sometimes drop to 200ish and then need to run much faster
to make up for it, unsteadily.

If you see anything like that it's also very enlightening to do a
"service cpuspeed stop" and see if everything starts going much
smoother. From what I saw, this is a common issue.

Disk tuning / io flush interval messing / (solid state) disk
hardware has less effect than a RRDCached+Ramdisk-based setup as we do.

A good indicator for the "is my server keeping up" is to make sure that
your Nagios process doesn't get over 70% for more than 0.5-2s per
Minute.
<iterate>Anything longer and the poor thing will have scramble and
reschedule.
While it does that, it will not check.
And when it wants to resume checking, it will all of a sudden it will
find out it needs to reschedule some checks because they
missed their moment of fame</iterate>

Greets,
Florian

···

Jason Humes <JHumes@acs.on.ca> wrote:

--
Mathias Kettner GmbH
Registergericht: Amtsgericht München, HRB 165902
Firmensitz: Preysingstraße 74, 81667 München
Geschäftsführer: Mathias Kettner

Tel. 089 / 1890 4210
Fax 089 / 1890 4211
http://mathias-kettner.de

Jason_Humes · August 21, 2012, 1:25pm

Great, thanks for the info Can you share some insight into RRDCached/RAMDisk?

Thanks

J

···

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Florian Heigl
Sent: Tuesday, August 21, 2012 9:22 AM
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] how to read the speed-o-meter and general perf talk

Just found this post, that would have answered pretty many of the questions Answering "for the record".

On Thu, 2 Aug 2012 12:26:01 +0000
Jason Humes <JHumes@acs.on.ca> wrote:

We've currently got 436 hosts with a total of 20276 services being
monitored. The server seems pretty responsive but we're starting to
see a bit of lag in the popup graphs and a couple of our special
templates only half draw on the first click, the second click will
draw the full dataset...wondering if we're starting to hit a
performance ceiling. The speed-o-meter always reads full right/100
and hovering it pops up 'Scheduled check-rate 338.8/s, current rate
492.6/s, that is 145% of the scheduled rate'

it should normally stay around 100%.
Reasons for it exceeding 100% are for example on-demand hostchecks, or instability of the Nagios performance (as you were experiencing I think) Then it will sometimes drop to 200ish and then need to run much faster to make up for it, unsteadily.

If you see anything like that it's also very enlightening to do a "service cpuspeed stop" and see if everything starts going much smoother. From what I saw, this is a common issue.

Disk tuning / io flush interval messing / (solid state) disk hardware has less effect than a RRDCached+Ramdisk-based setup as we do.

A good indicator for the "is my server keeping up" is to make sure that your Nagios process doesn't get over 70% for more than 0.5-2s per Minute.
<iterate>Anything longer and the poor thing will have scramble and reschedule.
While it does that, it will not check.
And when it wants to resume checking, it will all of a sudden it will find out it needs to reschedule some checks because they missed their moment of fame</iterate>

Greets,
Florian

--
Mathias Kettner GmbH
Registergericht: Amtsgericht München, HRB 165902
Firmensitz: Preysingstraße 74, 81667 München
Geschäftsführer: Mathias Kettner

Tel. 089 / 1890 4210
Fax 089 / 1890 4211
http://mathias-kettner.de
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Jason_Humes · August 21, 2012, 2:00pm

Since installing our SSD our performance issues have gone except for the half drawing pnp graph/special.templates, and the speed-o-meter actually reads an even higher %, usually like 180-190%, but we see no issues, missed checks, gaps or anything wrong...so not sure I understand the logic, but ok.

This is running OMD nightly/check_mk nightly.

Thanks J

···

-----Original Message-----
From: checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de] On Behalf Of Florian Heigl
Sent: Tuesday, August 21, 2012 9:22 AM
To: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] how to read the speed-o-meter and general perf talk

Just found this post, that would have answered pretty many of the questions Answering "for the record".

On Thu, 2 Aug 2012 12:26:01 +0000
Jason Humes <JHumes@acs.on.ca> wrote:

We've currently got 436 hosts with a total of 20276 services being
monitored. The server seems pretty responsive but we're starting to
see a bit of lag in the popup graphs and a couple of our special
templates only half draw on the first click, the second click will
draw the full dataset...wondering if we're starting to hit a
performance ceiling. The speed-o-meter always reads full right/100
and hovering it pops up 'Scheduled check-rate 338.8/s, current rate
492.6/s, that is 145% of the scheduled rate'

it should normally stay around 100%.
Reasons for it exceeding 100% are for example on-demand hostchecks, or instability of the Nagios performance (as you were experiencing I think) Then it will sometimes drop to 200ish and then need to run much faster to make up for it, unsteadily.

If you see anything like that it's also very enlightening to do a "service cpuspeed stop" and see if everything starts going much smoother. From what I saw, this is a common issue.

Disk tuning / io flush interval messing / (solid state) disk hardware has less effect than a RRDCached+Ramdisk-based setup as we do.

A good indicator for the "is my server keeping up" is to make sure that your Nagios process doesn't get over 70% for more than 0.5-2s per Minute.
<iterate>Anything longer and the poor thing will have scramble and reschedule.
While it does that, it will not check.
And when it wants to resume checking, it will all of a sudden it will find out it needs to reschedule some checks because they missed their moment of fame</iterate>

Greets,
Florian

--
Mathias Kettner GmbH
Registergericht: Amtsgericht München, HRB 165902
Firmensitz: Preysingstraße 74, 81667 München
Geschäftsführer: Mathias Kettner

Tel. 089 / 1890 4210
Fax 089 / 1890 4211
http://mathias-kettner.de
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en