(Service Check Timed Out) in RAW edition 1.6.0p11 (centOS 8)

Hi

I have a problem with my network switches monitoring (with SNMP) :
I receive a lot of CRITICAL alerts with (Service Check Timed Out)

For example :

Host switch-bureau04(switch-bureau04)
Service Check_MK
Event OK → CRITICAL
Address 192.168.254.248
Date / Time Wed May 20 14:55:31 CEST 2020
Plugin Output (Service Check Timed Out)
Additional Output
Host Metrics rta=3.383ms;15000.000;30000.000;0; pl=0%;80;100;; rtmax=6.584ms;;;; rtmin=2.215ms;;;;
Service Metrics

And few seconds after the recovery :

Host switch-bureau04 (switch-bureau04)
Service Check_MK
Event CRITICAL → OK
Address 192.168.254.248
Date / Time Wed May 20 14:55:39 CEST 2020
Plugin Output OK - [snmp] Success, execution time 4.5 sec
Additional Output
** Host Metrics
Service Metrics execution_time=4.482 user_time=0.020 system_time=0.020 children_user_time=0.050 children_system_time=0.070 cmk_time_snmp=4.322 cmk_time_agent=0.000

How to change parameters and which one ? which rule ?
I’ve tried to change “service_check_timeout” value to 120 (instead of 60) in tuning.cfg but it’s the same thing.
I also changed snmp check interval with 5 minutes.
Thanks for your help :slight_smile:

Regards.

Sebastien

It is not a problem with overall timeout. For myself it looks more like a SNMP timing problem.
Inside WATO you can find a rule Timing settings for SNMP access. You should play around a little bit with the parameters. I think the timeout for a single query can help. Set it to something like 5 seconds and test.

it doesn’t work.

My settings for the folder “Network” - contains all switches, big and very small (5 ports to big stack) :

I receive the timeout for many switches, small and big

For the big stack in 1 week :

Last alert :

Host switch-stack2-2eme(switch-stack2-2eme)
Service Check_MK
Event CRITICAL → OK
Address 192.168.254.242
Date / Time Wed May 20 20:49:56 CEST 2020
Plugin Output OK - [snmp] Success, execution time 31.8 sec
Additional Output
Host Metrics rta=3.950ms;15000.000;30000.000;0; pl=0%;80;100;; rtmax=9.051ms;;;; rtmin=1.404ms;;;;
Service Metrics execution_time=31.751 user_time=0.070 system_time=0.050 children_user_time=0.080 children_system_time=0.130 cmk_time_snmp=31.417 cmk_time_agent=0.000

Bad cheap hardware :smiley: if a stack with 3 switches needs 100 seconds to get the whole interface table.
Small switches should not have this problem.
The only diagram important is the “Datasource: Time usage by phase”. There you see your system is only waiting for SNMP response.
Is this device configured to use bulkwalk?
If all this is done you don’t have many possibilities to solve this problem.

I think you only have a chance to increase check_interval and snmp_timeouts to reduce your false alarms. Because if you have a 60s check_interval and you have to wait 70s for response it the next check is scheduled.
Do all your switches have this problems or only larger ones?

Hmm, sometimes you don’t have the choice… I have these SNMP problems with management boards of not-so-cheap servers too. For a long time, I tried to tweak time values, bulk mode, different SNMP protocol versions – nothing really helped. Finally, I have set the check count so the first failure will not cause notifications – it will switch to OK with 99% certainty on next invocation.

Regards,
sultansofswing.

My SNMP check interval is 5 minutes and timeout is 120s.
It’s very strange because I have problems on all switches, (smallers like HP-1810-8G and larger switches like HPE Office Connect 1950 12 XGT 4SFP+ or stacks HPE Office Connect 1950 48 ports)

I’ll try to increase check interval…

If it also happens on small devices then it looks more like a general problem.
You can only inspect the response of your devices on the command line with a call like “cmk --debug -vv hostname”.
If you see that all devices are slow at the interface table then you have a general problem with all the devices. If you have other brands available to compare you will see very different behavior.

Maybe it is also a good idea to test some of the device with a pure snmpwalk, to see if this gives you some new/more details on your problem or what makes it so slow.