[Check_mk (english)] Check_MK 0 values and long execution time for stacked Juniper EX2200 switches

Hi All,

We are still working on this issue - it got to the point where I had to setup OMD on a virtual instance @ JTAC to prove we are having issues, after lots of emails coming back and forth I’ve had this back from Juniper and was wondering if someone will more in-depth knowledge of check_mk and snmp could comment? Below the reply from JTAC.

—snip—

The way they are polling is not optimized, please have a
look at the following 2 documents, share these with customer. They will
understand if we give reason and explain what is wrong with this kind of column
based bulk walking. Their polling pattern is wasting a lot of CPU cycles, and
causing unnecessary communication between mib2d and kernel.

Optimizing the Network Management System Configuration for
the Best Results

http://www.juniper.net/techpubs/en_US/junos12.2/topics/task/configuration/snmp-best-practices-nms-optimizing…html

~ Change the polling pattern to Row-by-Row basis, i.e. poll
all the required table attributes for a given ?Index? before moving on to next
index.

Example:

Query 1 OIDs: → ifHCInOctets.1 ifHCInUcastPkts.1
ifHCInMulticastPkts.1 ifHCInBroadcastPkts.1

Query 2 OIDs: → ifHCInOctets.2 ifHCInUcastPkts.2
ifHCInMulticastPkts.2 ifHCInBroadcastPkts.2

is much much better than doing the following [this is what
the customer is doing right now]:

Query 1 OIDs: → ifHCInOctets.1 ifHCInOctets.2
ifHCInOctets.3 ifHCInOctets.4

Query 2 OIDs:-> ifHCInUcastPkts…1 ifHCInUcastPkts.2
ifHCInUcastPkts.3 ifHCInUcastPkts.4

They are fetching same variable of all interfaces before
moving to next variable. So to serve their 1 smnp request - router has to get
ALL data for ALL interfaces on each query.

Out of ALL data fetched from kernel in 1 snmp request they
just use 1 variable - then later they send a new request which results in
fetching ALL data for ALL interfaces again this is highly unoptimized polling.

—snip—

They are essentially saying we need to change the way we are polling, which is done via check_mk/nagios, I’m not even sure where I need to start looking or if there any function within check_mk which would help us?

Appreciate any help in this matter.

Kind Regards,

William

···

On 9 June 2015 at 13:22, Gary Herbstman garyh@bytesolutions.com wrote:

https://www.google.com/search?q=ex2200+snmp&rls=com.microsoft:en-US&ie=UTF-8&oe=UTF-8&startIndex=&startPage=1&gws_rd=ssl#q=ex2200+snmp+timeout

From: William [mailto:willay@gmail.com]
Sent: Tuesday, June 09, 2015 02:21

To: Gary Herbstman
Cc: Andreas Döhler; checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Check_MK 0 values and long execution time for stacked Juniper EX2200 switches

Hi Gary,

Can you please share the search terms/links you’ve found related to version update and snmp issues? Are you talking about switch or check_mk?

FYI, we are running JUNOS 12.3R9.4

Kind Regards,

William

On 8 June 2015 at 23:16, Gary Herbstman garyh@bytesolutions.com wrote:

Sounding like maybe that switch is not up to the task when it comes to snmp.

A quick search shows others having issues and a version update that addresses several known snmp issues.

From: William [mailto:willay@gmail.com]
Sent: Monday, June 08, 2015 17:24
To: Gary Herbstman
Cc: Andreas Döhler;
checkmk-en@lists.mathias-kettner.de

Subject: Re: [Check_mk (english)] Check_MK 0 values and long execution time for stacked Juniper EX2200 switches

Hi Gary,

The hosts we are trying to contact are one hop away on the local network, at the moment I’m only having issues with EX2200 stacks (up to 4 members) - we have a bunch of ex4200/500
that don’t have any issues. I have a ex2200 which isn’t stacked with the same simple snmp config and it responds within the time allowed, however I put it on 5 minute intervals because it was reporting high CPU usage on the 1 minute interval.

Kind Regards,

William

On 8 June 2015 at 22:08, Gary Herbstman garyh@bytesolutions.com wrote:

We have also seen timeout issues with a large 48 x 4 stack. Eventually we added a local monitoring
server which took care of the issue. Trying to monitor and inventory over a pretty fast WAN was not working well.

From:
checkmk-en-bounces@lists.mathias-kettner.de [mailto:checkmk-en-bounces@lists.mathias-kettner.de]
On Behalf Of William
Sent: Monday, June 08, 2015 16:07
To: Andreas Döhler
Cc: checkmk-en@lists.mathias-kettner.de
Subject: Re: [Check_mk (english)] Check_MK 0 values and long execution time for stacked Juniper EX2200 switches

Hi Andreas,

Thanks for the reply.

Inventory- from watching the console it seems the if64 inventory is taking the longest, these entries being the longest to name a few

Running snmpbulkwalk -v2c -c ‘removed’ -m ‘’ -M ‘’ -Cc -OQ -OU -On -Ot 1.2.3.4 .1.3.6.1.2.1.2.2.1.19

Running snmpbulkwalk -v2c -c ‘removed’ -m ‘’ -M ‘’ -Cc -OQ -OU -On -Ot 1.2.3.4 .1.3.6.1.2.1.2.2.1.20

Running snmpbulkwalk -v2c -c ‘removed’ -m ‘’ -M ‘’ -Cc -OQ -OU -On -Ot 1.2.3.4 .1.3.6.1.2.1.2.2.1.21

Running snmpbulkwalk -v2c -c ‘removed’ -m ‘’ -M ‘’ -Cc -OQ -OU -On -Ot 1.2.3.4 .1.3.6.1.2.1.31.1.1.1.18

Total running time was 2m11s

-n ; looks like the if64 bits was taking longer again, sadly I haven’t worked out how to time stamp each line. Also interfaces are being reported as 0 traffic which is not the case,
there is plenty of background noise:

Interface ae0 OK - [581] (up) MAC: 44:f4:77:ad:30:43, 2.00GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ae0.0 OK - [582] (up) MAC: 44:f4:77:ad:30:43, 2.00GBit/s, in: 0.00B/s, out: 0.00B/s

Interface bme0 OK - [37] (up) MAC: 00:0b:ca:fe:00:01, speed unknown, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/0 OK - [stuff] (up) MAC: 84:b5:9c:8a:68:03, 1GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/1 OK - [503] (up) MAC: 84:b5:9c:8a:68:04, 100MBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/10 OK - [522] (up) MAC: 84:b5:9c:8a:68:0d, 1GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/11 OK - [524] (up) MAC: 84:b5:9c:8a:68:0e, 10MBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/12 OK - [526] (up) MAC: 84:b5:9c:8a:68:0f, 1GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/13 OK - [528] (down) MAC: 84:b5:9c:8a:68:10, speed unknown

Interface ge-0/0/14 OK - [530] (up) MAC: 84:b5:9c:8a:68:11, 1GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/15 OK - [532] (up) MAC: 84:b5:9c:8a:68:12, 1GBit/s, in: 0.00B/s, out: 0.00B/s

Interface ge-0/0/16 OK - [534] (up) MAC: 84:b5:9c:8a:68:13, 10MBit/s, in: 0.00B/s, out: 0.00B/s

OK - execution time 121.7 sec|execution_time=121.667 user_time=0.730 system_time=0.080 children_user_time=0.030 children_system_time=0.050

The stack I queried is 2x 48p.

Where do I go from here? it appears to be taking more than 60 seconds and returning incorrect results?

Thanks for your time.

Kind Regards,

William

On 8 June 2015 at 17:52, Andreas Döhler andreas.doehler@gmail.com wrote:

Hi William,

The first steps should be running the inventory from the command line and also the check to see which part takes so long.

cmk --debug -vv -II hostname
and
cmk --debug -vv -n hostname

Now you can see where the switch takes so long to answer. Beside this you have all the executed snmpwalk commands to check by yourself the single steps.

Best regards
Andreas

William willay@gmail.com schrieb am Mo., 8. Juni 2015 10:56:

Morning list,

I’m running OMD with Check_MK 1.2.4p5, currently monitoring over 100 hosts without an issue until last week when we brought up 3 pairs of stacked Juniper EX2200 switches, when monitoring
these switches using check_mk I suffer from timeouts, bandwidth values being returned as 0 and very long inventory times (approx 260 seconds) when doing a Live Scan.

I have standalone Juniper EX2200s which have no issues with being monitored, along with various bits of network equipment.

I have already gone down the route of contacting Juniper and they are unable to find anything wrong with my switch configuration and nothing has come up when checking their bug
lists, the stacks vary from 2 members to 4, the smallest stacks consist of 96~ ports however I have other switch stacks which do not experience the same issue (2x Juniper 4200 48p for example).

What would be the best way to troubleshoot this issue from my check_mk server? I’d appreciate any help on the matter.

Kind Regards,

William


checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
http://lists.mathias-kettner.de/mailman/listinfo/checkmk-en

We’ll meet in Munich for the 2nd Check_MK Conference!
Book your place now and be part of it.
October 18th-20th, 2015
http://mathias-kettner.com/conference