I understand your concerns. What i had done in our environments is the following.
First take the agent for only the standard checks and make the special ones with classic nagios checks. Then i ported most of the classic checks to MK.
I payed attention to time consuming checks like the old check_esx3.pl <http://check_esx3.pl> - these checks i leave on 5/1 min interval.
In the end i get a system where i check every host one time per minute with the hostcheck and one active check of MK. In your setup this would mean
1500 host checks and 1500 service checks per minute on 16 cores. Should be no problem. I would recommend also to leave the classic checks on the longer check
interval until you know if your setup can handle it.
br
Andreas
2014-03-14 2:23 GMT+01:00 Gavin Grieve [DATACOM] <Gavin.Grieve@datacom.co.nz <mailto:Gavin.Grieve@datacom.co.nz>>:
I guess we worry about scalability. We're not exclusively a
Check_MK shop .. yet - we still have some classic style checks
(which we can do the 5m/1m polling with) although I'm slowly
working my way through them and redoing them as Check_MK style
tests so they can be integrated properly with Check_MK and
inventorised.
Our largest standalone monitoring server is polling 1477 hosts
with 15121 active service checks using Nagios 3.4.4 on a physical
HP DL360 G8 server (2x Intel E5-2690 8 core Xeon CPUs, 24GB RAM,
15k RPM disks). When we first put this server in we weren't
getting through our normal 5 minute poll cycle (far from it
actually). Installing Mod-gearman solved that problem and quite
spectacularly too. Adding Livestatus and Thruk as a replacement
frontend solved UI problems the standard Nagios interface was
having (taking forever to do anything which I assume was just the
time spent processing the status.dat file). We moved a lot of the
random I/O off to ramdisk too.
The biggest concern we have with moving to 1 minute checks is what
impact this will have on our servers.
My initial thoughts were that if I could get the Check_MK service
to represent the worst state of all the passively monitored
services then we could retain our 5 minute poll interval and 1
minute fault condition retry interval, then I thought roughly what
Doug mentioned - on bad states submit a command as part of the
active Check_MK check to reschedule its own next check for now+60
seconds when any bad state is found.
So at the moment we're going to go ahead with 1 minute polls but I
have to think about what we're going to do if we find
performance/scalability issues on our monitoring servers -
especially for larger deployments or ones where our server is some
sort of virtual machine that doesn't have the performance of our
physical servers.
*-- *
*Gavin Grieve***
*From:*Andreas D�hler [mailto:andreas.doehler@gmail.com
<mailto:andreas.doehler@gmail.com>]
*Sent:* Friday, 14 March 2014 4:40 a.m.
*To:* Gavin Grieve [DATACOM]
*Cc:* checkmk-en@lists.mathias-kettner.de
<mailto:checkmk-en@lists.mathias-kettner.de>
*Subject:* Re: [Check_mk (english)] Check_MK and Polling Intervals
Hi,
you can only define the poll interval for active checks. As you
already mentioned this will only affect the CMK service.
Is there a problem to poll with one minute? At my installations
there is no problem with the standard setting of one minute.
br
Andreas
2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM]
<Gavin.Grieve@datacom.co.nz <mailto:Gavin.Grieve@datacom.co.nz>>:
Hi,
Our normal monitoring configuration involved polling at 5
minute intervals, and then reducing the poll interval to 1
minute when a problem state is discovered on any of the
monitored hosts/services.
With the move to Check_MK (which we love and are extending as
much as we can with our own checks - which I hope to pass back
or release sometime) we've found that we can't do this anymore
as the only active check against hosts is the Check_MK service
- and when any of the monitored services are in a bad state
Check_MK is still OK so will still poll at 5 minute intervals.
I've been hunting around to see if it's possible to have the
Check_MK service state represent the worst severity of the
monitored services to force it into a faster poll interval if
anything being monitored on the server is in a bad state but
haven't found anything as yet. I did see mention of
aggregate_check_mk however this appears to hook into Check_MK
BI - which I don't think will do what we want.
Does anybody know if this sort of normal poll interval vs
retry poll interval is possible with Check_MK? If not I
suspect I may need to get my code on. J
*-- *
*Gavin Grieve*
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
<mailto:checkmk-en@lists.mathias-kettner.de>
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en