New service check timeout issues after upgrade to 2.1? SNMP issue?

cdnvballer · September 14, 2022, 8:57pm

CMK version: 2.1.0p9.cre
**OS version: CentOS 7.9 **

Error message:
random network devices (SNMP only) report Check_MK_Discovery → Service Check Timed out
some are Cisco, some are Juniper. doesn’t make sense to me, worked fine in the prior 2.0 release

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
cmk --debug -vvn test-router
Checkmk version 2.1.0p9
Try license usage history update.
Trying to acquire lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Got lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Trying to acquire lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Got lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Next run time has not been reached yet. Abort.
Releasing lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Released lock on /omd/sites/netwk_stage/var/check_mk/license_usage/history.json
Releasing lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run
Released lock on /omd/sites/netwk_stage/var/check_mk/license_usage/next_run

FETCHING DATA
Source: SourceType.HOST/FetcherType.SNMP
[cpu_tracking] Start [7f27f7bd42b0]
[SNMPFetcher] Fetch with cache settings: SNMPFileCache(test-router, base_path=/omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=False, use_outdated=False, simulation=False)
Not using cache (Too old. Age is 26 sec, allowed is 0 sec)
[SNMPFetcher] Execute data source
juniper_cpu_util: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.5’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.8’
juniper_bgp_state: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.2’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.3’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.5.1.1.2.1.1.1.11’
juniper_fru: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.5’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.6’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.15.1.8’
juniper_mem: Fetching data (SNMP walk cache is enabled: Use any locally cached information)
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.5.9’
Running ‘snmpwalk -v1 -c dida^uwh5r5 -m “” -M “” -Cc -OQ -OU -On -Ot 192.168.0.2 .1.3.6.1.4.1.2636.3.1.13.1.11.9’
Write data to cache file /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
Trying to acquire lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
Got lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
Releasing lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
Released lock on /omd/sites/netwk_stage/tmp/check_mk/data_source_cache/snmp/checking/test-router
[cpu_tracking] Stop [7f27f7bd42b0 - Snapshot(process=posix.times_result(user=0.020000000000000018, system=0.04999999999999999, children_user=0.01, children_system=0.07, elapsed=4.129999999888241))]
Source: SourceType.HOST/FetcherType.PIGGYBACK
[cpu_tracking] Start [7f27f7bd4820]
[PiggybackFetcher] Fetch with cache settings: NoCache(test-router, base_path=/omd/sites/netwk_stage/tmp/check_mk/data_source_cache/piggyback, max_age=MaxAge(checking=0, discovery=120, inventory=120), disabled=True, use_outdated=False, simulation=False)
Not using cache (Cache usage disabled)
[PiggybackFetcher] Execute data source
No piggyback files for ‘test-router’. Skip processing.
No piggyback files for ‘192.168.0.2’. Skip processing.
Not using cache (Cache usage disabled)
[cpu_tracking] Stop [7f27f7bd4820 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
PARSE FETCHER RESULTS
Source: SourceType.HOST/FetcherType.SNMP
No persisted sections
→ Add sections: [‘juniper_bgp_state’, ‘juniper_cpu_util’, ‘juniper_fru’, ‘juniper_mem’]
Source: SourceType.HOST/FetcherType.PIGGYBACK
No persisted sections
→ Add sections: []
Received no piggyback data
Received no piggyback data
[cpu_tracking] Start [7f27f7b7d040]
value store: synchronizing
Trying to acquire lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
Got lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
value store: loading from disk
Releasing lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
Released lock on /omd/sites/netwk_stage/tmp/check_mk/counters/test-router
BGP Status Peer 192.168.0.1 Status with peer 192.168.0.1 is established, operational status: running
BGP Status Peer 192.168.251.2 Status with peer 192.168.251.2 is established, operational status: running
BGP Status Peer 192.168.251.253 Status with peer 192.168.251.253 is established, operational status: running
CPU utilization Routing Engine Total CPU: 11.00%
Fan FRU SRX240 CPU fan 1 Operational status: online
Fan FRU SRX240 CPU fan 2 Operational status: online
Fan FRU SRX240 IO fan 1 Operational status: online
Fan FRU SRX240 IO fan 2 Operational status: online
Fan FRU SRX240 PowerSupply fan 1 Operational status: online
Fan FRU SRX240 PowerSupply fan 2 Operational status: online
Memory Utilization Routing Engine 45% used
Power Supply FRU PEM 0 Operational status: online
No piggyback files for ‘test-router’. Skip processing.
No piggyback files for ‘192.168.0.2’. Skip processing.
[cpu_tracking] Stop [7f27f7b7d040 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[snmp] Success, execution time 4.1 sec | execution_time=4.130 user_time=0.020 system_time=0.050 children_user_time=0.010 children_system_time=0.070 cmk_time_snmp=3.980 cmk_time_agent

andreas-doehler · September 15, 2022, 5:36am

You can do a “cmk -vvI hostname” to see how long the discovery takes on this host.
I had some devices with the same problem after 2.1 upgrade as there are some more SNMP checks now that the discovery needs to test. This lead to an increased time needed for discovery.
In the end i deactivated some SNMP sections inside the setup to prevent that these section are used in the discovery.

cdnvballer · September 15, 2022, 2:28pm

Hi Andreas,

the problem is happening amongst both Cisco and Juniper devices, which is stumping me.
in the examples below, latency from the monitoring node and the target devices is 38ms for both…

on Juniper device:
[snmp] Success, execution time 5.2 sec | execution_time=5.170 user_time=0.050 system_time=0.050 children_user_time=0.020 children_system_time=0.110 cmk_time_snmp=4.940 cmk_time_agent=0.000

on Cisco device:
[snmp] Success, execution time 4.0 sec | execution_time=4.000 user_time=0.080 system_time=0.050 children_user_time=0.020 children_system_time=0.090 cmk_time_snmp=3.760 cmk_time_agent=0.000

rediscoveries shouldn’t have to happen frequently and once I add the service checks that are relevant, I don’t necessarily need rescans/rediscoveries to run and flag failure due to timeout… hmm

andreas-doehler · September 15, 2022, 2:43pm

The normal execution time is not so important. You need to check the time needed for a discovery run.
In your example is this the discovery time? If yes then it would be ok and should be no problem.
For the discovery service the normal interval is set to something between 4 and 6 hours. But you can also configure it to run only every 2 days.
To configure this service correctly it is important to know how long such a discovery run takes.
Inside CRE you have the problem that you can only configure the timeout for service checks globally.
To avoid this configuration i would recommend to disable SNMP sections not needed on this devices but tried by the discovery.
If you measure the discovery time on the command line you see all sections it tests. Take a look at the list of tested sections and disable the not needed ones.

cdnvballer · September 15, 2022, 4:27pm

Hi Andreas,
so here’s what I’m trying
from Setup → Agents → SNMP rules → Disabled or enabled sections (SNMP)
I create a rule, and proceed to disable EVERYTHING that is not relevant to my environment.
i.e. all vendor equipment that is not present
I leave the vendor-specific sections that apply to my gear.

should that be enough to shrink the check_mk snmp discovery time?

cdnvballer · September 21, 2022, 10:18pm

confirming that this ended up working for me, thanks again!

system · September 21, 2023, 10:18pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.