Check_mk service in state CRIT during short periods of time. It affects only snmp devices randomly

CMK version: Enterprise Edition 2.3.0p33
OS version: Rocky Linux release 8.9 (Green Obsidian)

Error message: [snmp] Fetcher for host “NS30“ timed out after 120 seconds (…)

Output of “cmk --debug -vvn hostname”:

Attached is a trace of the command executed at the exact moment the problem appeared. I ran it several times while the problem was visible on the web, but apparently nothing is shown in the trace:

OMD[site5]:~$  cmk --debug -vvn ns30
value store: synchronizing
Trying to acquire lock on /omd/sites/site5/tmp/check_mk/counters/ns30
Got lock on /omd/sites/site5/tmp/check_mk/counters/ns30
value store: loading from disk
Releasing lock on /omd/sites/site5/tmp/check_mk/counters/ns30
Released lock on /omd/sites/site5/tmp/check_mk/counters/ns30
Checkmk version 2.3.0p33
Updating IPv4 DNS cache for ns30: ***********
Trying to acquire lock on /omd/sites/site5/var/check_mk/ipaddresses.cache
Got lock on /omd/sites/site5/var/check_mk/ipaddresses.cache
Releasing lock on /omd/sites/site5/var/check_mk/ipaddresses.cache
Released lock on /omd/sites/site5/var/check_mk/ipaddresses.cache

FETCHING DATA
Source: SourceInfo(hostname=‘ns30’, ipaddress=*********, ident=‘piggyback’, fetcher_type=<FetcherType.PIGGYBACK: 4>, source_type=<SourceType.HOST: 1>)
[cpu_tracking] Start [7f7ffecafec0]
Read from cache: NoCache(ns30, path_template=/dev/null, max_age=MaxAge(checking=0.0, discovery=0.0, inventory=0.0), simulation=False, use_only_cache=False, file_cache_mode=1)
No piggyback files for ‘ns30’. Skip processing.
No piggyback files for *********. Skip processing.
Get piggybacked data
[cpu_tracking] Stop [7f7ffecafec0 - Snapshot(process=posix.times_result(user=0.0, system=0.0, children_user=0.0, children_system=0.0, elapsed=0.0))]
[cpu_tracking] Start [7f7ffb3c4d10]

PARSE FETCHER RESULTS
HostKey(hostname=‘ns30’, source_type=<SourceType.HOST: 1>)  → Add sections: 

Received no piggyback data
No piggyback files for ‘ns30’. Skip processing.
No piggyback files for **********. Skip processing.
[cpu_tracking] Stop [7f7ffb3c4d10 - Snapshot(process=posix.times_result(user=0.2200000000000002, system=0.07000000000000006, children_user=0.0, children_system=0.0, elapsed=0.2800000002607703))]
[piggyback] Success (but no data found for this host), execution time 0.3 sec | execution_time=0.280 user_time=0.220 system_time=0.070 children_user_time=0.000 children_system_time=0.000 cmk_time_agent=0.000

Additional Information:

The reported Check_MK service error only appears on SNMP-monitored hosts and occurs randomly.

The checkMK servers have sufficient resources. The number of fetchers and checkers on each server is at the highest value allowed by the configuration.

We attempted to resolve the issue by selecting a group of problematic hosts and changing their SNMP monitoring from inline to classical. However, alarms related to this service continue to appear on these hosts. An example of this is the one we shared in this post, concerning the NS30 host.

Hi,

do have checked this blog? :slight_smile:

Normaly its because some devices aren’t responding fast enough. :slight_smile:

BR

Berni

Hello @bernhard.dolezal

Thanks for your reply. We will check that blog

Regards