CMK 2.1: service discovery fails for slow devices

bitwiz · August 7, 2022, 5:19pm

CMK version: CEE 2.1.0p9
OS version: Ubuntu 20.04.4 LTS x64

Error message: Error running automation call try-inventory: Your request timed out after 110 seconds. This issue may be related to a local configuration problem or a request which works with a too large number of objects. But if you think this issue is a bug, please send a crash report.

Has there been a change between 2.0.x and 2.1.x that causes service discovery (Host → service configuration) to time out after 110 seconds?

We’ve got some devices polled via SNMP that take a long time to be polled - a Checkmk polling cycle takes about 150 seconds, and when doing a service discovery we’d talk about 170 seconds.

This was not an issue for years, up until upgrading to 2.1.0p5 (and all subsequent stable patch releases) - discovery would take however long it needed to (we’ve set the upper limit for service checks to 3 minutes for those hosts), but eventually succeed.

Now, with 2.1.0, it is impossible for us to use the Web GUI to make changes to services discovered as we don’t even get any chance to click “tabula rasa”, or using cached results.

At the moment I modify ~/var/check_mk/autochecks/MYHOST.mk manually.

Is there some setting that “110 seconds” correlates to and which can be modified? Is it even an intended behavior change in branch 2.1 that force-aborts the GUI service discovery after X seconds, or a bug?

andreas-doehler · August 7, 2022, 6:47pm

How long the real discovery takes should be measured on the command line with “cmk --debug -vvI hostname”.
As there are more SNMP checks available with every version it is possible that your device has a problem with some checks.
One point i would also have a look at is the resource usage on your monitoring system after the upgrade. The 2.1 release needs more hardware resources as the 2.0. But not so much.

bitwiz · August 8, 2022, 11:17am

The discovery (I used your command, prepended with “time”) does not take longer with 2.1

SUCCESS - Found no new services

real 2m32.276s
user 0m5.959s
sys 0m0.967s

The issue is the hard cutoff after 110 seconds, apparently newly introduced in 2.1 for the Web GUI discovery? I did not find a corresponding incompatible werk, therefore suppose that this cutoff is not supposed to happen?

grep -R 110 ~/etc/check_mk/conf.d/wato/ does not display any results either (so there is no rule where we’d set any check parameter to 110 seconds deliberately?)

andreas-doehler · August 8, 2022, 3:50pm

2 minutes and 32 seconds are very long for a discovery. The 110 seconds are a hard coded timeout inside the web server component. There is no config option and this was already there before 2.1.
It is possible that your host needed not so much time with the 2.0. As there are some fewer checks to discover.
Also here i would look what takes so long (what check) and if not needed i would disable some SNMP sections for this device.

bitwiz · August 8, 2022, 8:10pm

There is no way to see historical execution time for the Check_MK Discovery Check, is there?

If I use the Check_MK service execution time as approximation it’s obvious that nothing changed before or after the introduction of 2.1:

Also, here’s a screenshot of a subset of “Service check timeout (Microcore)” rules we’ve had (and remained unchanged) for years:

As you can see we set the timeout for the Check_MK and the Discovery service to over 3 minutes, or nearly 4 minutes for the horribly, horribly, horribly slow Juniper EX2200 switches.

This never was an issue for the Check_MK Discovery service status or the Web GUI discovery - in fact, with 2.1, the Check_MK Discovery check still completes successfully just fine thanks to this extension of the default check timeout.

It’s only the Web GUI that aborts prematurely, after exactly 110 seconds, and only since updating to 2.1, for the very same devices that most certainly took way, way longer than 110 seconds for service discovery even with 2.0 (or 1.6, for that matter).

So either the 110 seconds timeout was never enforced in previous versions, whether deliberate or not, or I’m out of ideas.

In any case, I don’t understand what would be the point of such a hardcoded timeout for the automation command - if my device takes 5 minutes to complete a Discover scan, why should I not be able to watch the “Updating” message in my browser for 5 minutes? Precisely the fact that this cutoff only happens if the administrator deliberately and manually wants to kick off a discovery scan (as opposed to the Check_MK Discovery service that runs automatically every $INTERVAL) makes me scratch my head as to the point of it? Is there a point to it?

Also, is there any other way to do what we are now prevented to do thanks to this unwelcome threshold? I’m not aware of any other way to see new, vanished, existing services on the same page, with the button to modify check parameters for any of them before activating changes.
Well, other than cmk -I for the hosts, or auto-modifying the set of services via the automatic service discovery check, and afterwards scrolling through the autochecks file?

By the way, in this case we’re talking about Juniper switches that take more or less linearily more time per active (UP) interface, about 6 seconds per interface, to walk. As these are 48 port switches, a discovery of the standard IF-MIB would take nearly 5 minutes, and that does not include DOM or similar OIDs… Thankfully our switches are not fully utilized…

Even if we could get those to report on cached results of their Broadcom NICs instead of querying them live at the time of polling, we’ve got other devices such as GPON OLT headends with > 100 devices (ONTs) connected, all of them with 4 interfaces or more… All reported via 1 headend - those won’t finish in 110 seconds either, just by virtue of the massive amount of individual SNMP entries in their trees.

EDIT:

With regards to Juniper switches in particular, this was already discussed on checkmk-en in 2015:

and

Looks like a “row-based” querying would solve or alleviate slow SNMP polling for Juniper devices at least, yet I don’t see any way how the current implementation of Checkmk Check API would allow for that. (and indeed it has not been implemented since 2015). Yet, even if Juniper was handled somehow, we’ve got the other devices as mentioned.

bitwiz · August 11, 2022, 3:39pm

To save others the research:

lib/check_mk/gui/http.py:

    @property
    def request_timeout(self) -> int:
        """The system web servers configured request timeout.

        This is the time before the request terminates from the view of the client."""
        # TODO: Found no way to get this information from WSGI environment. Hard code
        #       the timeout for the moment.
        return 110

In my case I’ve now changed it to 230 (in order to test whether this is indeed the right place)

NOTE: This needs to happen in concert with a corresponding change to the PHP-FPM site config:

etc/apache/conf.d/02_fcgid.conf

IPCCommTimeout 120
GracefulShutdownTimeout 120

… which I’ve changed to 240.

Don’t forget to “omd restart apache”, afterwards my switch discovery worked fine again.

I’ve got no idea why it was working fine for years despite the obvious intention (and FPM settings?) suggesting otherwise, but it most certainly did work, without any modification to the builtin Checkmk lib/ code

Loonen · November 2, 2022, 8:48am

Thanks for pointing out the solution, we also have the same problem @bitwiz described after an upgrade to 2.1.x
We have the issue with some ILO’s where the discovery scan takes about 3 minutes to complete, especially in the night when there are cloud backups running.

urmel · November 25, 2022, 12:05pm

Thanks for pointing out the solution, we also have the same problem @bitwiz described after an upgrade to 2.1.x

We had no problems with version 2.1.0. Only today after I updated to 2.1.0p16 there are these problems with Huawei switches.

urmel · November 25, 2022, 3:09pm

I changed the timeout to 300 in this files

/opt/omd/sites/mysite/etc/apache/conf.d/02_fcgid.conf
/opt/omd/versions/2.1.0p16.cre/lib/python3/cmk/gui/http.py

After that slow switches etc. bring this messages:

Failed to fetch state [502]: Proxy Error. Retrying in 1 second.

In case this error persists for more than some seconds, please verify that all processes of the site are running.

jcheckmk · January 3, 2023, 4:53pm

Hello,

Followed @bitwiz instructions however running into the same issue as @urmel.
Is there a fix for this or a fix around as we have a stack of juniper switches 9 devices and snmp keeps timing out per checkmk. SNMP walk seems ok so we know it appears to be a checkmk limitation. If anyone can assist this would be great!

m.ludeke · January 4, 2023, 12:55pm

The following default web timeouts must be observed for Checkmk:

Client ==== 300 ===> Main-Apache ==== 120 ===> Site-Apache === 110 ===> WSGI/Python

Here, the 110 seconds for the execution of Python code have simply been hardcoded. The Site Apache instance is addressed via ReverseProxy, the default timeout was set to 120 seconds in apache config so that WSGI errors still have ~ 10 seconds to send their response to the client.

In addition to the changes in http.py, the timeout of the ReverseProxy settings must be set too.

In “/omd/apache/SITE.conf”

ProxyPass http://127.0.0.1:5001/SITE retry=0 disablereuse=On timeout=310

For “old” migrated sites, the config is located under:

/omd/sites/SITE/etc/apache/proxy-port.conf

After that, restart main and site apache.

jcheckmk · January 5, 2023, 11:12am

Thank you so much for the explanation and walk through! Works like a charm!

MarsellusWallace · May 26, 2023, 7:49am

FYI: Discovery timeout has been fixed in Discovery: Improve performance and prevent apache timeout