SNMP Polling - Check_MK Discovery (Service Check Timed Out)

We setup Checkmk for a proof of concept in our office. It worked fine with a handful of nodes. We added ~50 devices using SNMP polling and we are getting more and more timeouts showing - Check_MK Discovery (Service Check Timed Out). We added more RAM and CPU to the Ubuntu server and it did not help. It using ~30% of the overall resources at most at peak. We are unable to see why the SNMP polling is not working for all devices. Please advise on how we do this.

TY,

Anthony

CMK version:
2.3.0p20

OS version:
Ubuntu 24.04

Error message:
Check_MK Discovery (Service Check Timed Out) - We do SNMP Polling only

Hi @arabbito,
you can try to increase the timeout for snmp request.
Will it work if you just monitor 5 devices? Have also a look at the server performance and kernelstatistics inside CMK. Read a little bit of performance of CMK. Sounds you are using a virtual CMK server, this is IMHO not the best idea… maybe for testing and or small enviroments…

Ah… sorry… you have written it worked fine with only a few clients.
Sounds like performance issues. SNMP is a little b**** :wink: Try differnt snmp settings and have a good look at the performance statistics from CMK itself.

Thank you for this info. We adjusted the SNMP to be the max of 60sec for the timeout, but that did not resolve it. This server is running on SSDs with plenty of power behind it. It is not using much of it and yet the polling is very slow and we don’t understand why. How do we check the performance of this?

I am unsure, but i think you have to monitore the monitoring server itself to get the performance infos inside the site…
Then the performance is a “service” with a lot of info. Also you can add plugins to the dashboad for a quick overview.

In global settings you can tweak and tune CMK, but please read the documentation about that.

Where are these settings for the Fetcher and such? We don’t see them in our portal. We’ve tweaked all we can find and it is still the slowest thing running here. We don’t understand. This was supposed to be a fast tool. We are testing the RAW edition to see how well it works. It all is good after testing, we will get the paid version.

NM. Seems like the RAW edition does not have these available. We will look for another tool to use outside of CheckMK. TY for the help.

We run CMK on retired hyperv host server :wink: on hard disk, not on SSD… next year we get new server and we can take next generation of old hyperv host for CMK. We have over 900 hosts (on more than130 bad WANs), more as 13k (!) services, round about 50% from SNMP: printer, switches, access points.
Yes we use micro kernel (CEE editon on Debian Linux), because of performance and for the windows and linux agents you can use the bakery.

50 Hosts really should not be a problem, also not for RAW edition on a good hardware…
May try longer timeouts AND longer wait periods between checks.
Are the 50 Hosts in LAN or over WAN connections?

Did you test all of the devices while you add them? What i am asking, are you sure you can poll them? That is something we had to learn… SNMP is a mean b**** :wink: Just a little firmware difference on a printer modell and one pice work and next not. Most problems we had with SMNP v3. SMNP v2 in most cases works. But even there are some hints to know.

In case the poll is unsuccsessfull, it is repeated 3 times with 1 minute pause if not configured a other way. That can also cause problems.

Some points that should be checked if there are problems with SNMP devices.

  • on command line as site user do a “cmk --debug -vvI snmpdevicename”
  • check the time at the end that the command needed
  • is this stopping with an error at a specific section - then disable this section for this device (this happens often)

What is not the problem most times.

  • hardware resources
  • general SNMP timeout settings - it leads to more problems if the SNMP timeout settings are done wrong.
1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.