Local cached checks are not updated since upgrade to 2.0 (outages are NOT recognized)

Since update to 2.0 (raw) all cached local checks are displayed as stale in ui. Data is sent by the agent as usual.

A manual “Reschedule CheckMK” does a check an puts the service back to actual but after that it keeps being stale.
I’m not sure if this is the case for all local checks but for at least a huge amount.

One thing to note (and perhaps the trigger for the problem). Most (if not all) of these checks are cached for 45 seconds to achieve an asynchronous check with an update on each check intervall.

Regards Michael

Is this a Linux/Unix system where the local checks are from?
It would also be important what version of 2.0 you use exactly - with p10 was a bigger change inside the Linux agent regarding cached/asynchronous checks.

Hi Andreas,

sorry for not mention that:
yes currently these are all linux systems. CheckMK version is the acutal 2.0.0p11

Regards Michael

Hi
one update:

Checks which are not ok seems to be updated. But on checks which are ok it seems that the state of the “first green” ist kept.

For the moment I assume that cached local checks are only updated on state change and therefore are mentioned as stale

Another update:

I have to revert my previous statement.

We had an outage last night. Check is displayed as green and stale.
After a manual “Reschedule Check” the state changed to warn because of the too old file (which would have been changed to critical 6 hours ago)

So currently we have an issue with monitoring and not only with display

Regards Michael

Hi

I checked the local plugin and I think I found (perhaps) a solution (to be verified).

I changed line 319 in /opt/omd/sites//lib/python3/cmk/base/plugins/agent_based/local.py

from

    if local_result.cached.elapsed_lifetime_percent > 100:

to

    if local_result.cached.elapsed_lifetime_percent > 200:

and all services seems to be up to date.

For a real fix this should perhaps not use 200% as a threshold but a combination of cache time, scheduled check time and a little bit of time buffer.

e.g. max(cache_time, scheduled check interval) * 1,1

Hi,

created a PR https://github.com/tribe29/checkmk/pull/399

Regards Michael

hi @micha! I believe one of my collegues has answered your PR, but for the benefit of others reading this: The creation of the caches is triggered either every 60 seconds, independently of polling the data (for systemd based setups) or right after the agent data has been polled (for xinetd based setups).
This means that if your cache interval is smaller than the check interval, you will never get unexpired data in the xinetd case, and in the systemd case only if you’re lucky.

Hi Moritz,

that’s correct and is exactly the reason why we chose a cache interval of 45 seconds.

Just to be sure: we had no problem on the agent side. Caching works and caches are updated every minute without any fix.

The problem occured on the checkmk-server-side were the results of the cached checks were interpreted as stale and therefore ignored (due to the age greater than the check interval ~ 133%).
So this fix is also only on the checkmk-server-side. The check now takes a result as stale not before twice the check time.

Without this fix cached checks < 1 minute are completely broken and useless.

Regards Michael

Well, yes, they are. Unfortunately there’s only one number you can configure. This number tells the agent how often to create the data (which you want to set to “every time”), and it tells the server how long the data is valid. In your setup the server will most likely never see valid data, so the service goes stale. I believe that part to be correct behavior.

What you want to achieve, really, is to create the data often, but then have it valid for just a bit more than a check interval.

I’d say we should change the sed -e "/^<<</! s/^/$CACHE_INFO /" "$CACHEFILE" line in the agents’ run_cached, such that the “$CACHE_INFO” prefix is only written, if not already present.

This would allow your local check to write it itself, allowing for a longer validity interval, while still being executed more often.

(I think in the long run the caching mechanism should be redesigned to allow a de-coupling of the actual caching, and execotion method. We can see clearly in this case caching is not even desired, we just want to have it executed asynchronously.)

Hi,

this variant worked perfectly with checkmk 1.
A definition on the “client side” of the validity intervall would be a possibility but would require to change the checks, have different checks for cached/non cached and have to configure somewhere the validility.

The biggest drawback lies in the fact that the client has no knowledge about the check period. It just can assume that the default of 1 minute (or another customer defined one) is used.

The checkmk is the only one who knows the check period and should take care of that. If the cachetime is less than the check period (and the last check attempt was successfull) this should in my eyes definitly not considered stale (as there was no younger attempt to check that).

For a long term solution I agree to provide a possibility to just asynchronously execute checks (perhaps all local checks?) Actually a cache time less than the check interval is the only possibility to achieve that.

Generally a threshold of exactly 100% is perhaps too strict as this could be affected by minor timing issues. That’s the reason why I suggested max(cache_time, check interval) plus 10% as threshold

btw: we have also the default configuration “consider as stale if no data in the last 1,5 check attempts” which is not taken place here (this should be enough for 45 seconds cache and check all 60 seconds)

Regards Michael

Hi

Why would someone set an check interval smaller than the Agent Interval? Because long running checks can have a negative effect on the total agent runtime due to the sequential execution of the checks. Therefore, such checks should be executed async (official recommendation from tribe29). To ensure that the check data is updated on every agent run, the check interval must be smaller than the agent interval.

The problem is that cmk does not take into account that in such cases the check data can’t be updated bevor the next agent run. So if the check interval on client side is smaller than the agent interval on CMK side, the agent interval should be used to validate the age of the check data.

And yes, the execution and caching mechanism for plugins, local and mrpe checks should be redesigned to allow more checks to be executed within an agent call. The windows agent has already implemented many good ideas for check handling.

Regards, Lars