New service check timeout issues after upgrade to 2.1? SNMP issue?

CMK version: 2.1.0p9.cre
**OS version: CentOS 7.9 **

Error message:
random network devices (SNMP only) report Check_MK_Discovery → Service Check Timed out
some are Cisco, some are Juniper. doesn’t make sense to me, worked fine in the prior 2.0 release

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
cmk --debug -vvn test-router
Checkmk version 2.1.0p9
Try license usage history update.
Trying to acquire lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Got lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Got lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Released lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Released lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run

  • FETCHING DATA
    Source: SourceType.HOST/FetcherType.SNMP
    [cpu_tracking] Start [7f27f7bd42b0]
    [SNMPFetcher] Fetch with cache settings: SNMPFileCache(test-router, base_path=/omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
    Not using cache (Too old. Age is 26 sec, allowed is 0 sec)
    [SNMPFetcher] Execute data source
    juniper_cpu_util: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.5’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.8’
    juniper_bgp_state: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.2’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.3’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.11’
    juniper_fru: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.5’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.6’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.8’
    juniper_mem: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.5.9’
    Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.11.9’
    Write data to cache file /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
    Trying to acquire lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
    Got lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
    Releasing lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
    Released lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
    [cpu_tracking] Stop [7f27f7bd42b0 - Snapshot(process=posix.times_result(user=0.020000000000000018, system=0.04999999999999999, children_user=0.01, children_system=0.07, elapsed=4.129999999888241))]
    Source: SourceType.HOST/FetcherType.PIGGYBACK
    [cpu_tracking] Start [7f27f7bd4820]
    [PiggybackFetcher] Fetch with cache settings: NoCache(test-router, base_path=/omd/sites/netwk_stage/tmp/check_mk/data_source_cache/piggyback, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=True, use_outdated=False, simulation=False)
    Not using cache (Cache usage disabled)
    [PiggybackFetcher] Execute data source
    No piggyback files for ‘test-router’. Skip processing.
    No piggyback files for ‘192.168.0.2’. Skip processing.
    Not using cache (Cache usage disabled)
    [cpu_tracking] Stop [7f27f7bd4820 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
  • PARSE FETCHER RESULTS
    Source: SourceType.HOST/FetcherType.SNMP
    No persisted sections
    → Add sections: [‘juniper_bgp_state’, ‘juniper_cpu_util’, ‘juniper_fru’, ‘juniper_mem’]
    Source: SourceType.HOST/FetcherType.PIGGYBACK
    No persisted sections
    → Add sections: []
    Received no piggyback data
    Received no piggyback data
    [cpu_tracking] Start [7f27f7b7d040]
    value store: synchronizing
    Trying to acquire lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
    Got lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
    value store: loading from disk
    Releasing lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
    Released lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
    BGP Status Peer 192.168.0.1 Status with peer 192.168.0.1 is established, operational status: running
    BGP Status Peer 192.168.251.2 Status with peer 192.168.251.2 is established, operational status: running
    BGP Status Peer 192.168.251.253 Status with peer 192.168.251.253 is established, operational status: running
    CPU utilization Routing Engine Total CPU: 11.00%
    Fan FRU SRX240 CPU fan 1 Operational status: online
    Fan FRU SRX240 CPU fan 2 Operational status: online
    Fan FRU SRX240 IO fan 1 Operational status: online
    Fan FRU SRX240 IO fan 2 Operational status: online
    Fan FRU SRX240 PowerSupply fan 1 Operational status: online
    Fan FRU SRX240 PowerSupply fan 2 Operational status: online
    Memory Utilization Routing Engine 45% used
    Power Supply FRU PEM 0 Operational status: online
    No piggyback files for ‘test-router’. Skip processing.
    No piggyback files for ‘192.168.0.2’. Skip processing.
    [cpu_tracking] Stop [7f27f7b7d040 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
    [snmp] Success, execution time 4.1 sec | execution_time=4.130 user_time=0.020 system_time=0.050 children_user_time=0.010 children_system_time=0.070 cmk_time_snmp=3.980 cmk_time_agent

You can do a “cmk -vvI hostname” to see how long the discovery takes on this host.
I had some devices with the same problem after 2.1 upgrade as there are some more SNMP checks now that the discovery needs to test. This lead to an increased time needed for discovery.
In the end i deactivated some SNMP sections inside the setup to prevent that these section are used in the discovery.

Hi Andreas,

the problem is happening amongst both Cisco and Juniper devices, which is stumping me.
in the examples below, latency from the monitoring node and the target devices is 38ms for both…

on Juniper device:
[snmp] Success, execution time 5.2 sec | execution_time=5.170 user_time=0.050 system_time=0.050 children_user_time=0.020 children_system_time=0.110 cmk_time_snmp=4.940 cmk_time_agent=0.000

on Cisco device:
[snmp] Success, execution time 4.0 sec | execution_time=4.000 user_time=0.080 system_time=0.050 children_user_time=0.020 children_system_time=0.090 cmk_time_snmp=3.760 cmk_time_agent=0.000

rediscoveries shouldn’t have to happen frequently and once I add the service checks that are relevant, I don’t necessarily need rescans/rediscoveries to run and flag failure due to timeout… hmm

The normal execution time is not so important. You need to check the time needed for a discovery run.
In your example is this the discovery time? If yes then it would be ok and should be no problem.
For the discovery service the normal interval is set to something between 4 and 6 hours. But you can also configure it to run only every 2 days.
To configure this service correctly it is important to know how long such a discovery run takes.
Inside CRE you have the problem that you can only configure the timeout for service checks globally.
To avoid this configuration i would recommend to disable SNMP sections not needed on this devices but tried by the discovery.
If you measure the discovery time on the command line you see all sections it tests. Take a look at the list of tested sections and disable the not needed ones.

Hi Andreas,
so here’s what I’m trying
from Setup → Agents → SNMP rules → Disabled or enabled sections (SNMP)
I create a rule, and proceed to disable EVERYTHING that is not relevant to my environment.
i.e. all vendor equipment that is not present
I leave the vendor-specific sections that apply to my gear.

should that be enough to shrink the check_mk snmp discovery time?

confirming that this ended up working for me, thanks again!