[Check_mk (english)] High system load on check_mk host

Dan_Babb · August 2, 2012, 6:00am

I’m having a huge problem with check_mk installation at my current employer. The system is hovering around 35-40% load constantly, the graphs that we need for trending per host take up around 160Mb, not much in retrospect but when you scale that out too encompass our customer facing vm’s and leave all of our infrastructure out of it, we run out of disk space. The amount of Disk I/O the monitoring system needs has exceeded what our san is capable of. From what I’ve seen there isn’t a way to really combat this problem. Is it possible to drop .rrd graphing and go with something like say cacti instead? Can all of this data be sent too a database? Is something possibly configured wrong or is there something else that I need to be looking at to figure out what’s going on. As it stands right now, while a great plugin for nagios, I can’t scale our monitoring with this kind of load on the system. I’ve only got about half of our production in, and I refuse to put the rest of our environment in since I’m worried it will take the entire box down (at least so far I only have to restart apache/check_mk 2-3 times a day). Any help and or guidance would be greatly appreciatted!

Marius_Pana · August 2, 2012, 6:16am

How many hosts are you monitoring? Are you using check_mk exclusively or with SNMP and legacy/active checks as well? Whats the ration between the checks. We had similar issues with performance on a host with roughly 400 hosts. We enabled livecheck and that reduced the load considerably. IO was not an issue in our case and we are using local SAS 10K disks.

···

On Aug 2, 2012, at 9:00 AM, Dan Babb <cerberus@dividebyfail.org> wrote:

I'm having a huge problem with check_mk installation at my current employer. The system is hovering around 35-40% load constantly, the graphs that we need for trending per host take up around 160Mb, not much in retrospect but when you scale that out too encompass our customer facing vm's and leave all of our infrastructure out of it, we run out of disk space. The amount of Disk I/O the monitoring system needs has exceeded what our san is capable of. From what I've seen there isn't a way to really combat this problem. Is it possible to drop .rrd graphing and go with something like say cacti instead? Can all of this data be sent too a database? Is something possibly configured wrong or is there something else that I need to be looking at to figure out what's going on. As it stands right now, while a great plugin for nagios, I can't scale our monitoring with this kind of load on the system. I've only got about half of our production in, and I refuse to put the rest of our environment in since I'm worried it will take the entire box down (at least so far I only have to restart apache/check_mk 2-3 times a day). Any help and or guidance would be greatly appreciatted! _______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Florian_Heigl1 · August 2, 2012, 7:05am

Hi,

Is that system based on OMD or a manual install?
In the latter you might be missing a lot of optimizations (rrdcached, tmpfs for most volatile data) that reduce io load like 50 times.
You'll have to build the same changes and it should get a lot better!

I dont know / think a database based solution would do better, at least not the one i've seen
Disabling the graphs / rrds is of course possible, but even during benchmarking I rarely did that.

So are you using OMD and how many services does the setup have?

Flo

···

On 02.08.2012, at 08:00, Dan Babb <cerberus@dividebyfail.org> wrote:

I'm having a huge problem with check_mk installation at my current employer. The system is hovering around 35-40% load constantly, the graphs that we need for trending per host take up around 160Mb, not much in retrospect but when you scale that out too encompass our customer facing vm's and leave all of our infrastructure out of it, we run out of disk space. The amount of Disk I/O the monitoring system needs has exceeded what our san is capable of. From what I've seen there isn't a way to really combat this problem. Is it possible to drop .rrd graphing and go with something like say cacti instead? Can all of this data be sent too a database? Is something possibly configured wrong or is there something else that I need to be looking at to figure out what's going on. As it stands right now, while a great plugin for nagios, I can't scale our monitoring with this kind of load on the system. I've only got about half of our production in, and I refuse to put the rest of our environment in since I'm worried it will take the entire box down (at least so far I only have to restart apache/check_mk 2-3 times a day). Any help and or guidance would be greatly appreciatted!
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Steve_Irvine · August 2, 2012, 7:54am

Hi Dan,

Cacti uses RRD as well so won't solve the problem you're facing right now.

The only other thing I could suggest is using multisite to spread the load between multiple hosts so you don't have just one server doing everything.

Of course if they're all on the same san then your io problems won't be helped.

How many hosts are you monitoring?
Steve

···

Sent from a mobile device.

-----Original Message-----
From: Florian Heigl <fh@mathias-kettner.de>
Sender: checkmk-en-bounces@lists.mathias-kettner.deDate: Thu, 2 Aug 2012 09:05:23
To: Dan Babb<cerberus@dividebyfail.org>
Cc: checkmk-en@lists.mathias-kettner.de<checkmk-en@lists.mathias-kettner.de>
Subject: Re: [Check_mk (english)] High system load on check_mk host

Hi,

Is that system based on OMD or a manual install?
In the latter you might be missing a lot of optimizations (rrdcached, tmpfs for most volatile data) that reduce io load like 50 times.
You'll have to build the same changes and it should get a lot better!

I dont know / think a database based solution would do better, at least not the one i've seen
Disabling the graphs / rrds is of course possible, but even during benchmarking I rarely did that.

So are you using OMD and how many services does the setup have?

Flo

On 02.08.2012, at 08:00, Dan Babb <cerberus@dividebyfail.org> wrote:

I'm having a huge problem with check_mk installation at my current employer. The system is hovering around 35-40% load constantly, the graphs that we need for trending per host take up around 160Mb, not much in retrospect but when you scale that out too encompass our customer facing vm's and leave all of our infrastructure out of it, we run out of disk space. The amount of Disk I/O the monitoring system needs has exceeded what our san is capable of. From what I've seen there isn't a way to really combat this problem. Is it possible to drop .rrd graphing and go with something like say cacti instead? Can all of this data be sent too a database? Is something possibly configured wrong or is there something else that I need to be looking at to figure out what's going on. As it stands right now, while a great plugin for nagios, I can't scale our monitoring with this kind of load on the system. I've only got about half of our production in, and I refuse to put the rest of our envi

ronment in since I'm worried it will take the entire box down (at least so far I only have to restart apache/check_mk 2-3 times a day). Any help and or guidance would be greatly appreciatted!

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

Michael_Gebetsroithe · August 2, 2012, 8:22am

Hi Dan,

[ i/o Problem with bigger check_mk deployments ]

If you not already applied all the performance tuning to the monitoring stack you should either
do it or use OMD to get the best performance.

If it's still too slow...
Virtualization is not appropriate for this, just like high-throughput DB servers.
For bigger deployments get a dedicated box, and some SSD's for the rrd data and be done.
(those SSD's don't have to be completely "server grade", we are using intel 520 with
great success, you simply won't encounter anything near an i/o performance problem again).

kind regards,
Michael Gebetsroither

···

On 2012-08-02 08:00, Dan Babb wrote:

Florian_Heigl1 · August 2, 2012, 10:23am

I'd like to add a few things.
Firstofall, if you look at the cpu time used by python processes only,
you'll probably see Check_MK is using something like 3-5%?
The rest is usually Nagios, Nagios trashing the disk with status.dat
updates and Nagios forking causing a lot of SYS time, or RRD updates
trashing the disk.

That's why livecheck brings a lot of benefit, since it takes the
forking out of the picture. It's still experimental, but this is the
best thing you can do to make a single "core" scale higher.

Just please keep in mind the whole server would be on fire if you tried
the same w/o Check_MK.

Here is something about the RRD updates, this should help you compare
if you are hitting unexpected performance issues.

http://mathias-kettner.de/checkmk_checkmk_benchmarks.html

Look at the graphs at the end of the article, and the "last" values in
there.
4k hosts, 160k services w/RRDs equal around 300IOPS @ under 2
MB/s. (after the initial spike for creating the RRDs)

If you see more IOPS / MB/s for less services, then it's something with
your setup and you can use i.e. use iotop or bkltrace to find it.
We used to have that in our advanced nagios class, but dropped it since
it's not relevant any longer due to the optimizations in OMD.

You also see the CPU usage was quite high in that test. I think
on a current Xeon you can take 20% off that number.

Greetings,
Florian

···

On Thu, 2 Aug 2012 00:00:30 -0600 Dan Babb <cerberus@dividebyfail.org> wrote:

at to figure out what's going on. As it stands right now, while a
great plugin for nagios, I can't scale our monitoring with this kind
of load on the system. I've only got about half of our production in,
and I refuse to put the rest of our environment in since I'm worried
it will take the entire box down (at least so far I only have to
restart apache/check_mk 2-3 times a day). Any help and or guidance
would be greatly appreciatted!

--
Mathias Kettner GmbH
Registergericht: Amtsgericht München, HRB 165902
Firmensitz: Preysingstraße 74, 81667 München
Geschäftsführer: Mathias Kettner

Tel. 089 / 1890 4210
Fax 089 / 1890 4211
http://mathias-kettner.de