[Check_mk (english)] Check_MK and Polling Intervals

Hi,

Our normal monitoring configuration involved polling at 5 minute intervals, and then reducing the poll interval to 1 minute when a problem state is discovered on any of the monitored hosts/services.

With the move to Check_MK (which we love and are extending as much as we can with our own checks - which I hope to pass back or release sometime) we’ve found that we can’t do this anymore as the only active check against hosts is the Check_MK service - and when any of the monitored services are in a bad state Check_MK is still OK so will still poll at 5 minute intervals.

I’ve been hunting around to see if it’s possible to have the Check_MK service state represent the worst severity of the monitored services to force it into a faster poll interval if anything being monitored on the server is in a bad state but haven’t found anything as yet. I did see mention of aggregate_check_mk however this appears to hook into Check_MK BI - which I don’t think will do what we want.

Does anybody know if this sort of normal poll interval vs retry poll interval is possible with Check_MK? If not I suspect I may need to get my code on. J

···

Gavin Grieve

Hi,

you can only define the poll interval for active checks. As you already mentioned this will only affect the CMK service.

Is there a problem to poll with one minute? At my installations there is no problem with the standard setting of one minute.

br

Andreas

···

2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM] Gavin.Grieve@datacom.co.nz:

Hi,

Our normal monitoring configuration involved polling at 5 minute intervals, and then reducing the poll interval to 1 minute when a problem state is discovered on any of the monitored hosts/services.

With the move to Check_MK (which we love and are extending as much as we can with our own checks - which I hope to pass back or release sometime) we’ve found that we can’t do this anymore as the only active check against hosts is the Check_MK service - and when any of the monitored services are in a bad state Check_MK is still OK so will still poll at 5 minute intervals.

I’ve been hunting around to see if it’s possible to have the Check_MK service state represent the worst severity of the monitored services to force it into a faster poll interval if anything being monitored on the server is in a bad state but haven’t found anything as yet. I did see mention of aggregate_check_mk however this appears to hook into Check_MK BI - which I don’t think will do what we want.

Does anybody know if this sort of normal poll interval vs retry poll interval is possible with Check_MK? If not I suspect I may need to get my code on. J

Gavin Grieve


checkmk-en mailing list

checkmk-en@lists.mathias-kettner.de

http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

You could probably setup an event handler to reschedule an active check of the check_mk service when a critical result is found. You could calculate the time in the future that you want it run (60 seconds later? 120?) and then tell nagios to schedule it as such.

–Doug

···

On Thu, Mar 13, 2014 at 11:39 AM, Andreas Döhler andreas.doehler@gmail.com wrote:

Hi,

you can only define the poll interval for active checks. As you already mentioned this will only affect the CMK service.

Is there a problem to poll with one minute? At my installations there is no problem with the standard setting of one minute.

br

Andreas


checkmk-en mailing list

checkmk-en@lists.mathias-kettner.de

http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM] Gavin.Grieve@datacom.co.nz:

Hi,

Our normal monitoring configuration involved polling at 5 minute intervals, and then reducing the poll interval to 1 minute when a problem state is discovered on any of the monitored hosts/services.

With the move to Check_MK (which we love and are extending as much as we can with our own checks - which I hope to pass back or release sometime) we’ve found that we can’t do this anymore as the only active check against hosts is the Check_MK service - and when any of the monitored services are in a bad state Check_MK is still OK so will still poll at 5 minute intervals.

I’ve been hunting around to see if it’s possible to have the Check_MK service state represent the worst severity of the monitored services to force it into a faster poll interval if anything being monitored on the server is in a bad state but haven’t found anything as yet. I did see mention of aggregate_check_mk however this appears to hook into Check_MK BI - which I don’t think will do what we want.

Does anybody know if this sort of normal poll interval vs retry poll interval is possible with Check_MK? If not I suspect I may need to get my code on. J

Gavin Grieve


checkmk-en mailing list

checkmk-en@lists.mathias-kettner.de

http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

I guess we worry about scalability. We’re not exclusively a Check_MK shop … yet - we still have some classic style checks (which we can do the 5m/1m polling with) although I’m slowly working my way through them and redoing them as Check_MK style tests so they can be integrated properly with Check_MK and inventorised.

Our largest standalone monitoring server is polling 1477 hosts with 15121 active service checks using Nagios 3.4.4 on a physical HP DL360 G8 server (2x Intel E5-2690 8 core Xeon CPUs, 24GB RAM, 15k RPM disks). When we first put this server in we weren’t getting through our normal 5 minute poll cycle (far from it actually). Installing Mod-gearman solved that problem and quite spectacularly too. Adding Livestatus and Thruk as a replacement frontend solved UI problems the standard Nagios interface was having (taking forever to do anything which I assume was just the time spent processing the status.dat file). We moved a lot of the random I/O off to ramdisk too.

The biggest concern we have with moving to 1 minute checks is what impact this will have on our servers.

My initial thoughts were that if I could get the Check_MK service to represent the worst state of all the passively monitored services then we could retain our 5 minute poll interval and 1 minute fault condition retry interval, then I thought roughly what Doug mentioned - on bad states submit a command as part of the active Check_MK check to reschedule its own next check for now+60 seconds when any bad state is found.

So at the moment we’re going to go ahead with 1 minute polls but I have to think about what we’re going to do if we find performance/scalability issues on our monitoring servers - especially for larger deployments or ones where our server is some sort of virtual machine that doesn’t have the performance of our physical servers.

···

Gavin Grieve

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: Friday, 14 March 2014 4:40 a.m.
To: Gavin Grieve [DATACOM]
Cc: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Check_MK and Polling Intervals

Hi,

you can only define the poll interval for active checks. As you already mentioned this will only affect the CMK service.

Is there a problem to poll with one minute? At my installations there is no problem with the standard setting of one minute.

br

Andreas

2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM] Gavin.Grieve@datacom.co.nz:

Hi,

Our normal monitoring configuration involved polling at 5 minute intervals, and then reducing the poll interval to 1 minute when a problem state is discovered on any of the monitored hosts/services.

With the move to Check_MK (which we love and are extending as much as we can with our own checks - which I hope to pass back or release sometime) we’ve found that we can’t do this anymore as the only active check against hosts is the Check_MK service - and when any of the monitored services are in a bad state Check_MK is still OK so will still poll at 5 minute intervals.

I’ve been hunting around to see if it’s possible to have the Check_MK service state represent the worst severity of the monitored services to force it into a faster poll interval if anything being monitored on the server is in a bad state but haven’t found anything as yet. I did see mention of aggregate_check_mk however this appears to hook into Check_MK BI - which I don’t think will do what we want.

Does anybody know if this sort of normal poll interval vs retry poll interval is possible with Check_MK? If not I suspect I may need to get my code on. J

Gavin Grieve


checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

I understand your concerns. What i had done in our environments is the following.
First take the agent for only the standard checks and make the special ones with classic nagios checks. Then i ported most of the classic checks to MK.

I payed attention to time consuming checks like the old check_esx3.pl - these checks i leave on 5/1 min interval.

In the end i get a system where i check every host one time per minute with the hostcheck and one active check of MK. In your setup this would mean

1500 host checks and 1500 service checks per minute on 16 cores. Should be no problem. I would recommend also to leave the classic checks on the longer check

interval until you know if your setup can handle it.

br

Andreas

···

2014-03-14 2:23 GMT+01:00 Gavin Grieve [DATACOM] Gavin.Grieve@datacom.co.nz:

I guess we worry about scalability. We’re not exclusively a Check_MK shop … yet - we still have some classic style checks (which we can do the 5m/1m polling with) although I’m slowly working my way through them and redoing them as Check_MK style tests so they can be integrated properly with Check_MK and inventorised.

Our largest standalone monitoring server is polling 1477 hosts with 15121 active service checks using Nagios 3.4.4 on a physical HP DL360 G8 server (2x Intel E5-2690 8 core Xeon CPUs, 24GB RAM, 15k RPM disks). When we first put this server in we weren’t getting through our normal 5 minute poll cycle (far from it actually). Installing Mod-gearman solved that problem and quite spectacularly too. Adding Livestatus and Thruk as a replacement frontend solved UI problems the standard Nagios interface was having (taking forever to do anything which I assume was just the time spent processing the status.dat file). We moved a lot of the random I/O off to ramdisk too.

The biggest concern we have with moving to 1 minute checks is what impact this will have on our servers.

My initial thoughts were that if I could get the Check_MK service to represent the worst state of all the passively monitored services then we could retain our 5 minute poll interval and 1 minute fault condition retry interval, then I thought roughly what Doug mentioned - on bad states submit a command as part of the active Check_MK check to reschedule its own next check for now+60 seconds when any bad state is found.

So at the moment we’re going to go ahead with 1 minute polls but I have to think about what we’re going to do if we find performance/scalability issues on our monitoring servers - especially for larger deployments or ones where our server is some sort of virtual machine that doesn’t have the performance of our physical servers.

Gavin Grieve

From: Andreas Döhler [mailto:andreas.doehler@gmail.com]
Sent: Friday, 14 March 2014 4:40 a.m.
To: Gavin Grieve [DATACOM]
Cc: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Check_MK and Polling Intervals

Hi,

you can only define the poll interval for active checks. As you already mentioned this will only affect the CMK service.

Is there a problem to poll with one minute? At my installations there is no problem with the standard setting of one minute.

br

Andreas

2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM] Gavin.Grieve@datacom.co.nz:

Hi,

Our normal monitoring configuration involved polling at 5 minute intervals, and then reducing the poll interval to 1 minute when a problem state is discovered on any of the monitored hosts/services.

With the move to Check_MK (which we love and are extending as much as we can with our own checks - which I hope to pass back or release sometime) we’ve found that we can’t do this anymore as the only active check against hosts is the Check_MK service - and when any of the monitored services are in a bad state Check_MK is still OK so will still poll at 5 minute intervals.

I’ve been hunting around to see if it’s possible to have the Check_MK service state represent the worst severity of the monitored services to force it into a faster poll interval if anything being monitored on the server is in a bad state but haven’t found anything as yet. I did see mention of aggregate_check_mk however this appears to hook into Check_MK BI - which I don’t think will do what we want.

Does anybody know if this sort of normal poll interval vs retry poll interval is possible with Check_MK? If not I suspect I may need to get my code on. J

Gavin Grieve


checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

*replies from home*
Yep, that's what we were planning to do for now. We have a number of in house service checks we've written over the years and I'm slowly replacing them with Check_MK versions as I can - which should also reduce the load on our servers down the track.

Thanks for the assist.

···

On 14/03/2014 7:28 p.m., Andreas D�hler wrote:

I understand your concerns. What i had done in our environments is the following.
First take the agent for only the standard checks and make the special ones with classic nagios checks. Then i ported most of the classic checks to MK.
I payed attention to time consuming checks like the old check_esx3.pl <http://check_esx3.pl> - these checks i leave on 5/1 min interval.
In the end i get a system where i check every host one time per minute with the hostcheck and one active check of MK. In your setup this would mean
1500 host checks and 1500 service checks per minute on 16 cores. Should be no problem. I would recommend also to leave the classic checks on the longer check
interval until you know if your setup can handle it.

br
Andreas

2014-03-14 2:23 GMT+01:00 Gavin Grieve [DATACOM] <Gavin.Grieve@datacom.co.nz <mailto:Gavin.Grieve@datacom.co.nz>>:

    I guess we worry about scalability. We're not exclusively a
    Check_MK shop .. yet - we still have some classic style checks
    (which we can do the 5m/1m polling with) although I'm slowly
    working my way through them and redoing them as Check_MK style
    tests so they can be integrated properly with Check_MK and
    inventorised.

    Our largest standalone monitoring server is polling 1477 hosts
    with 15121 active service checks using Nagios 3.4.4 on a physical
    HP DL360 G8 server (2x Intel E5-2690 8 core Xeon CPUs, 24GB RAM,
    15k RPM disks). When we first put this server in we weren't
    getting through our normal 5 minute poll cycle (far from it
    actually). Installing Mod-gearman solved that problem and quite
    spectacularly too. Adding Livestatus and Thruk as a replacement
    frontend solved UI problems the standard Nagios interface was
    having (taking forever to do anything which I assume was just the
    time spent processing the status.dat file). We moved a lot of the
    random I/O off to ramdisk too.

    The biggest concern we have with moving to 1 minute checks is what
    impact this will have on our servers.

    My initial thoughts were that if I could get the Check_MK service
    to represent the worst state of all the passively monitored
    services then we could retain our 5 minute poll interval and 1
    minute fault condition retry interval, then I thought roughly what
    Doug mentioned - on bad states submit a command as part of the
    active Check_MK check to reschedule its own next check for now+60
    seconds when any bad state is found.

    So at the moment we're going to go ahead with 1 minute polls but I
    have to think about what we're going to do if we find
    performance/scalability issues on our monitoring servers -
    especially for larger deployments or ones where our server is some
    sort of virtual machine that doesn't have the performance of our
    physical servers.

    *-- *

    *Gavin Grieve***

    *From:*Andreas D�hler [mailto:andreas.doehler@gmail.com
    <mailto:andreas.doehler@gmail.com>]
    *Sent:* Friday, 14 March 2014 4:40 a.m.
    *To:* Gavin Grieve [DATACOM]
    *Cc:* checkmk-en@lists.mathias-kettner.de
    <mailto:checkmk-en@lists.mathias-kettner.de>
    *Subject:* Re: [Check_mk (english)] Check_MK and Polling Intervals

    Hi,

    you can only define the poll interval for active checks. As you
    already mentioned this will only affect the CMK service.

    Is there a problem to poll with one minute? At my installations
    there is no problem with the standard setting of one minute.

    br

    Andreas

    2014-03-11 22:28 GMT+01:00 Gavin Grieve [DATACOM]
    <Gavin.Grieve@datacom.co.nz <mailto:Gavin.Grieve@datacom.co.nz>>:

        Hi,

        Our normal monitoring configuration involved polling at 5
        minute intervals, and then reducing the poll interval to 1
        minute when a problem state is discovered on any of the
        monitored hosts/services.

        With the move to Check_MK (which we love and are extending as
        much as we can with our own checks - which I hope to pass back
        or release sometime) we've found that we can't do this anymore
        as the only active check against hosts is the Check_MK service
        - and when any of the monitored services are in a bad state
        Check_MK is still OK so will still poll at 5 minute intervals.

        I've been hunting around to see if it's possible to have the
        Check_MK service state represent the worst severity of the
        monitored services to force it into a faster poll interval if
        anything being monitored on the server is in a bad state but
        haven't found anything as yet. I did see mention of
        aggregate_check_mk however this appears to hook into Check_MK
        BI - which I don't think will do what we want.

        Does anybody know if this sort of normal poll interval vs
        retry poll interval is possible with Check_MK? If not I
        suspect I may need to get my code on. J

        *-- *

        *Gavin Grieve*

        _______________________________________________
        checkmk-en mailing list
        checkmk-en@lists.mathias-kettner.de
        <mailto:checkmk-en@lists.mathias-kettner.de>
        http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en