(Service Check Timed Out) in RAW edition 1.6.0p11 (centOS 8)

cotterse · May 20, 2020, 1:13pm

Hi

I have a problem with my network switches monitoring (with SNMP) :
I receive a lot of CRITICAL alerts with (Service Check Timed Out)

For example :

Host	switch-bureau04(switch-bureau04)
Service	Check_MK
Event	OK → CRITICAL
Address	192.168.254.248
Date / Time	Wed May 20 14:55:31 CEST 2020
Plugin Output	(Service Check Timed Out)
Additional Output
Host Metrics	rta=3.383ms;15000.000;30000.000;0; pl=0%;80;100;; rtmax=6.584ms;;;; rtmin=2.215ms;;;;
Service Metrics

And few seconds after the recovery :

Host	switch-bureau04 (switch-bureau04)
Service	Check_MK
Event	CRITICAL → OK
Address	192.168.254.248
Date / Time	Wed May 20 14:55:39 CEST 2020
Plugin Output	OK - [snmp] Success, execution time 4.5 sec
Additional Output
**	Host Metrics
Service Metrics	execution_time=4.482 user_time=0.020 system_time=0.020 children_user_time=0.050 children_system_time=0.070 cmk_time_snmp=4.322 cmk_time_agent=0.000

How to change parameters and which one ? which rule ?
I’ve tried to change “service_check_timeout” value to 120 (instead of 60) in tuning.cfg but it’s the same thing.
I also changed snmp check interval with 5 minutes.
Thanks for your help

Regards.

Sebastien

andreas-doehler · May 20, 2020, 6:39pm

It is not a problem with overall timeout. For myself it looks more like a SNMP timing problem.
Inside WATO you can find a rule Timing settings for SNMP access. You should play around a little bit with the parameters. I think the timeout for a single query can help. Set it to something like 5 seconds and test.

cotterse · May 20, 2020, 7:11pm

it doesn’t work.

My settings for the folder “Network” - contains all switches, big and very small (5 ports to big stack) :

I receive the timeout for many switches, small and big

cotterse · May 20, 2020, 7:12pm

For the big stack in 1 week :

cotterse · May 20, 2020, 7:15pm

Last alert :

Host	switch-stack2-2eme(switch-stack2-2eme)
Service	Check_MK
Event	CRITICAL → OK
Address	192.168.254.242
Date / Time	Wed May 20 20:49:56 CEST 2020
Plugin Output	OK - [snmp] Success, execution time 31.8 sec
Additional Output
Host Metrics	rta=3.950ms;15000.000;30000.000;0; pl=0%;80;100;; rtmax=9.051ms;;;; rtmin=1.404ms;;;;
Service Metrics	execution_time=31.751 user_time=0.070 system_time=0.050 children_user_time=0.080 children_system_time=0.130 cmk_time_snmp=31.417 cmk_time_agent=0.000

andreas-doehler · May 20, 2020, 7:37pm

Bad cheap hardware if a stack with 3 switches needs 100 seconds to get the whole interface table.
Small switches should not have this problem.
The only diagram important is the “Datasource: Time usage by phase”. There you see your system is only waiting for SNMP response.
Is this device configured to use bulkwalk?
If all this is done you don’t have many possibilities to solve this problem.

marass · May 20, 2020, 7:57pm

I think you only have a chance to increase check_interval and snmp_timeouts to reduce your false alarms. Because if you have a 60s check_interval and you have to wait 70s for response it the next check is scheduled.
Do all your switches have this problems or only larger ones?

sultansofswing · May 20, 2020, 10:09pm

Hmm, sometimes you don’t have the choice… I have these SNMP problems with management boards of not-so-cheap servers too. For a long time, I tried to tweak time values, bulk mode, different SNMP protocol versions – nothing really helped. Finally, I have set the check count so the first failure will not cause notifications – it will switch to OK with 99% certainty on next invocation.

Regards,
sultansofswing.

cotterse · May 22, 2020, 7:33am

My SNMP check interval is 5 minutes and timeout is 120s.
It’s very strange because I have problems on all switches, (smallers like HP-1810-8G and larger switches like HPE Office Connect 1950 12 XGT 4SFP+ or stacks HPE Office Connect 1950 48 ports)

I’ll try to increase check interval…

andreas-doehler · May 22, 2020, 7:56am

If it also happens on small devices then it looks more like a general problem.
You can only inspect the response of your devices on the command line with a call like “cmk --debug -vv hostname”.
If you see that all devices are slow at the interface table then you have a general problem with all the devices. If you have other brands available to compare you will see very different behavior.

marass · May 22, 2020, 8:17am

Maybe it is also a good idea to test some of the device with a pure snmpwalk, to see if this gives you some new/more details on your problem or what makes it so slow.

system · June 21, 2020, 6:17pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.