JunOS SNMP performance hints

darkfader · September 23, 2023, 2:42pm

This is a followup to threads like this one:

The core problem is that JunOS has a different SNMP daemon than most OS and it is handling bulk-getting of the tables very badly. This can lead to high CPU load and very slow processing of the requests made by checkmk. The issue scales with the number of active interfaces.

Here’s some advice how you can create filters for the snmp tables that will lessen the pain.

it has no effect to disable services on the server
you can use filter-interfaces on the switch to simply not report some interfaces
you can also use it to not report subinterfaces

that way, the volume of interfaces and polled OIDs loses the scaling issue.

Here’s an example configuration showing the most minimal snmp setup

xx@sw-da-ug-1# show snmp 
/* https://www.juniper.net/documentation/us/en/software/junos/network-mgmt/topics/topic-map/configuring-basic-snmp.html#id-filtering-interface-information-out-of-snmp-get-and-getnext-output */
filter-interfaces {
    interfaces {
        ipip;
        gre;
        jsrv;
        mtun;
        lsi;
        pimd;
        pime;
        tap;
        fti0;
        dsc;
        vme;
        "(a|t|x|g)e-[0-9]\/[0-9]+\/[0-9]+\.0$";
        gr-0/0/0;
        "irb$";
    }
    all-internal-interfaces;
}
community public {
    authorization read-only;
}

What this is doing:

filter out all internal stuff (OK on switches, but you need to modify if you run BGP or similar stuff that uses some of them)
filter out all SUBinterfaces/units on ethernet ports
filter out the main L3 interface (irb) but not any SUBinterfaces on that (irb.0, …)
filter out the vme interface, but keeping the vme0.0 subinterface (but please check on that in your env, especially if you run/don’t run VC)

here’s the same thing for copy-paste in set format, this can be copy-pasted and only shows the filters.

set snmp filter-interfaces interfaces ipip
set snmp filter-interfaces interfaces gre
set snmp filter-interfaces interfaces jsrv
set snmp filter-interfaces interfaces mtun
set snmp filter-interfaces interfaces lsi
set snmp filter-interfaces interfaces pimd
set snmp filter-interfaces interfaces pime
set snmp filter-interfaces interfaces tap
set snmp filter-interfaces interfaces fti0
set snmp filter-interfaces interfaces dsc
set snmp filter-interfaces interfaces vme
set snmp filter-interfaces interfaces "(a|t|x|g)e-[0-9]\/[0-9]+\/[0-9]+\.0$"
set snmp filter-interfaces interfaces gr-0/0/0
set snmp filter-interfaces interfaces "irb$"
set snmp filter-interfaces all-internal-interfaces

effect

maybe this helps some of you. I had it on my TODO list since 2018!
now finally got to try it, and it works well for me: the time for a full scan on this EX4300-48P went down from 59s to 31s.
I’ve not tried it against other platforms, as of now, but removing a whole class of problem is nice.

It is a lot easier than creating custom views and probably also hooks into a more efficient spot of the OS’ counter handling.

debugging

you can look at the statistics of how you’re polling info from the device. note that v1/v2c and v3 are handled differently. only v3 has support for custom views per requesting agent.

SNMP statistics:
  Input:
    Packets: 2424567, Bad versions: 0, Bad community names: 2,
    Bad community uses: 0, ASN parse errors: 0,
    Too bigs: 0, No such names: 0, Bad values: 0,
    Read onlys: 0, General errors: 0,
    Total request varbinds: 24002714, Total set varbinds: 0,
    Get requests: 16270, Get nexts: 10726, Set requests: 0,
    Get responses: 0, Traps: 0,
    Silent drops: 0, Proxy drops: 0, Commit pending drops: 0,
    Throttle drops: 0, Duplicate request drops: 0
  V3 Input:
    Unknown security models: 0, Invalid messages: 0
    Unknown pdu handlers: 0, Unavailable contexts: 0
    Unknown contexts: 0, Unsupported security levels: 0
    Not in time windows: 0, Unknown user names: 0
    Unknown engine ids: 0, Wrong digests: 0, Decryption errors: 0
  Output:
    Packets: 2424565, Too bigs: 0, No such names: 0,
    Bad values: 0, General errors: 0,
	Get requests: 0, Get nexts: 0, Set requests: 0,
    Get responses: 2424565, Traps: 0
  Performance:
    Average response time(ms): 55.94
Number of requests dispatched to subagents in last:
      1 minute:0, 5 minutes:535, 15 minutes:929
Number of responses dispatched to NMS in last:
      1 minute:0, 5 minutes:535, 15 minutes:929

You can find more debugging help hidden away at the very end of the JunOS SNMP FAQ

limiting factors

further tuning would be done on the checkmk server side by modifying the snmp polling settings or disabling bulk requests.
It would be best to measure the impact, but it would need to be done on a device with a single core CPU (ex2200c, ex2300) to really find the optimum.

if someone has inputs, just add/edit them.

on some vendors you can also influence the interval how often the counters in the switch hardware are polled (i.e. cisco, mellanox). I’m not sure if it was possible on JunOS, but IF it is possible and doesn’t mess with your accounting, you definitely should do that.

Most of the time the copy of the hw registers into the OS / snmp is triggered when the OID is polled, especially if the system falls behind on the snmp requests. If the data is found to be outdated, it’ll be fetched on the spot and that is what slows down the snmp response so much. So you always want to have the interval for the hw counter poll/updates to be higher than your runtime of the Check_MK service on the device.

but with JunOS the problem revolves largely around the fetch implementation.
it’s slow if you fetch table after table, and fast if you fetch all entries from all tables for an interface, and then the same for the next interface, and the next.
Juniper says they could generally handle 130 single-oid requests/second; assuming 5 relevant tables per interface for the more transient data, that would make 26 interfaces per second, meaning a 48 port switch should finish in 3-4 seconds max - but you’ll be looking at more like 30-60 seconds in the standard setup.
As long as neither cmk nor junos accomodates for the other it’ll stay as it is.

supporting rules

since I had one more timeout on a switch I added the following:

1. Timing settings for SNMP access

Host: Hersteller is Juniper Response timeout for a single query: 3.00 sec

2. Legacy SNMP devices using SNMP v2c

Host: Hersteller is Juniper Positive match (Add matching hosts to the set)

3. Bulk walk: Hosts using bulk walk

A modification here to exclude JunOS (or Juniper since it’s all JunOS) from the default bulkwalk.

Host: SNMP is not SNMP v1
Host has tag Monitor via SNMP
Host: Hersteller is not Juniper Positive match (Add matching hosts to the set) Hosts with the tag “snmp-v1” must not use bulkwalk

Btw, the inline help still states this ain’t the default which must go wayyyy back.

Result

The combination works well, and there haven’t been any more issues.

Core produced a notification	170 m	SERVICE NOTIFICATION	FLAPPINGSTOP (OK)	[snmp] Success, execution time 15.7 sec
Flapping	170 m	SERVICE FLAPPING ALERT	STOPPED	Service appears to have stopped flapping (3.9% change < 5.0% threshold)
Service Alert	230 m	SERVICE ALERT	SOFT (OK)	[snmp] Success, execution time 13.9 sec
Service Alert	240 m	SERVICE ALERT	SOFT (CRITICAL)	[snmp] SNMP Error on 192.168.xx.xx: Timeout: No Response from 192.168.xxx.xx (Exit-Code: 1)CRIT, Got no information from host, execution time 16.9 sec

after combining all of the above. And yes, the SNMP Check_MK service is faster in this fashion (15s agent run time where it was 21s+)

tl’dr:

general rule

The longer the snmp responses take, the more likely it is that a counter register needs to be polled
the more likely that is, the more likely the snmp agent response will block
the more that slows down the respoenses, the more likely the issue will happen again
for crappy OS limited on ressources, you need to hand-tune the number of oids per bulk (Cisco SG series)
for crappy enterprise OS or very large devices you need to build snmp v3 views that only expose what you’re gonna monitor (Huawei etc)

junos exception

checkmk likes table1, table2, table3 and processes server-side
junos likes table1.interface1 table2.interface1, table3.interface1
apply filtes so you can still go home early instead of looking at timeouts
test if you agent runtime goes down or up with bulk* requests

Moray · September 27, 2023, 8:00pm

Thanks for this, its valuable information, I too have struggled with Juniper and SNMP requests. I great write up!

MarsellusWallace · October 16, 2023, 11:00am

Great write up @darkfader, very much appreciated!
If it works for affected installations (I do not have JunOS devices available), all fine.
Another approach could be disabling bulkwalk for those hosts by ruleset “disable bulk walks on snmpv2c/v3” (positive match) or “enable snmpv2c and bulk walk for hosts” (negative match). Does not have to bring the same time boost, but if the bulkwalk is the issue…