Local cached checks are not updated since upgrade to 2.0 (outages are NOT recognized)

micha · September 27, 2021, 6:17pm

Since update to 2.0 (raw) all cached local checks are displayed as stale in ui. Data is sent by the agent as usual.

A manual “Reschedule CheckMK” does a check an puts the service back to actual but after that it keeps being stale.
I’m not sure if this is the case for all local checks but for at least a huge amount.

One thing to note (and perhaps the trigger for the problem). Most (if not all) of these checks are cached for 45 seconds to achieve an asynchronous check with an update on each check intervall.

Regards Michael

andreas-doehler · September 27, 2021, 8:34pm

Is this a Linux/Unix system where the local checks are from?
It would also be important what version of 2.0 you use exactly - with p10 was a bigger change inside the Linux agent regarding cached/asynchronous checks.

micha · September 28, 2021, 6:00am

Hi Andreas,

sorry for not mention that:
yes currently these are all linux systems. CheckMK version is the acutal 2.0.0p11

Regards Michael

micha · September 28, 2021, 7:24am

Hi
one update:

Checks which are not ok seems to be updated. But on checks which are ok it seems that the state of the “first green” ist kept.

For the moment I assume that cached local checks are only updated on state change and therefore are mentioned as stale

micha · September 28, 2021, 8:27am

Another update:

I have to revert my previous statement.

We had an outage last night. Check is displayed as green and stale.
After a manual “Reschedule Check” the state changed to warn because of the too old file (which would have been changed to critical 6 hours ago)

So currently we have an issue with monitoring and not only with display

Regards Michael

micha · September 28, 2021, 8:51am

Hi

I checked the local plugin and I think I found (perhaps) a solution (to be verified).

I changed line 319 in /opt/omd/sites//lib/python3/cmk/base/plugins/agent_based/local.py

from

    if local_result.cached.elapsed_lifetime_percent > 100:

to

    if local_result.cached.elapsed_lifetime_percent > 200:

and all services seems to be up to date.

For a real fix this should perhaps not use 200% as a threshold but a combination of cache time, scheduled check time and a little bit of time buffer.

e.g. max(cache_time, scheduled check interval) * 1,1

micha · October 4, 2021, 7:58am

Hi,

created a PR https://github.com/tribe29/checkmk/pull/399

Regards Michael

moritz · October 6, 2021, 9:11pm

hi @micha! I believe one of my collegues has answered your PR, but for the benefit of others reading this: The creation of the caches is triggered either every 60 seconds, independently of polling the data (for systemd based setups) or right after the agent data has been polled (for xinetd based setups).
This means that if your cache interval is smaller than the check interval, you will never get unexpired data in the xinetd case, and in the systemd case only if you’re lucky.

micha · October 7, 2021, 4:22am

Hi Moritz,

that’s correct and is exactly the reason why we chose a cache interval of 45 seconds.

Just to be sure: we had no problem on the agent side. Caching works and caches are updated every minute without any fix.

The problem occured on the checkmk-server-side were the results of the cached checks were interpreted as stale and therefore ignored (due to the age greater than the check interval ~ 133%).
So this fix is also only on the checkmk-server-side. The check now takes a result as stale not before twice the check time.

Without this fix cached checks < 1 minute are completely broken and useless.

Regards Michael

moritz · October 7, 2021, 6:31am

Well, yes, they are. Unfortunately there’s only one number you can configure. This number tells the agent how often to create the data (which you want to set to “every time”), and it tells the server how long the data is valid. In your setup the server will most likely never see valid data, so the service goes stale. I believe that part to be correct behavior.

What you want to achieve, really, is to create the data often, but then have it valid for just a bit more than a check interval.

I’d say we should change the sed -e "/^<<</! s/^/$CACHE_INFO /" "$CACHEFILE" line in the agents’ run_cached, such that the “$CACHE_INFO” prefix is only written, if not already present.

This would allow your local check to write it itself, allowing for a longer validity interval, while still being executed more often.

(I think in the long run the caching mechanism should be redesigned to allow a de-coupling of the actual caching, and execotion method. We can see clearly in this case caching is not even desired, we just want to have it executed asynchronously.)

micha · October 7, 2021, 7:34am

Hi,

this variant worked perfectly with checkmk 1.
A definition on the “client side” of the validity intervall would be a possibility but would require to change the checks, have different checks for cached/non cached and have to configure somewhere the validility.

The biggest drawback lies in the fact that the client has no knowledge about the check period. It just can assume that the default of 1 minute (or another customer defined one) is used.

The checkmk is the only one who knows the check period and should take care of that. If the cachetime is less than the check period (and the last check attempt was successfull) this should in my eyes definitly not considered stale (as there was no younger attempt to check that).

For a long term solution I agree to provide a possibility to just asynchronously execute checks (perhaps all local checks?) Actually a cache time less than the check interval is the only possibility to achieve that.

Generally a threshold of exactly 100% is perhaps too strict as this could be affected by minor timing issues. That’s the reason why I suggested max(cache_time, check interval) plus 10% as threshold

btw: we have also the default configuration “consider as stale if no data in the last 1,5 check attempts” which is not taken place here (this should be enough for 45 seconds cache and check all 60 seconds)

Regards Michael

LaSoe · October 12, 2021, 9:44am

Hi

Why would someone set an check interval smaller than the Agent Interval? Because long running checks can have a negative effect on the total agent runtime due to the sequential execution of the checks. Therefore, such checks should be executed async (official recommendation from tribe29). To ensure that the check data is updated on every agent run, the check interval must be smaller than the agent interval.

The problem is that cmk does not take into account that in such cases the check data can’t be updated bevor the next agent run. So if the check interval on client side is smaller than the agent interval on CMK side, the agent interval should be used to validate the age of the check data.

And yes, the execution and caching mechanism for plugins, local and mrpe checks should be redesigned to allow more checks to be executed within an agent call. The windows agent has already implemented many good ideas for check handling.

Regards, Lars

micha · November 10, 2021, 1:58pm

Hi,

as there is an outstanding update:
What is the preferred short-term solution (except manually repatching every time)?

Regards

Michael

micha · December 15, 2021, 10:52am

Hi

@moritz could you please give an answer about a valid short term solution?

Regards Michael

moritz · December 15, 2021, 12:43pm

I suggest you try to wrap the local checks, such they can be treated as plugin, and add the caching info yourself. Putting it in plugins/60 should work then: (untested)

#!/bin/sh
cat << HERE
<<<local>>>
cached(${GENERATED_AT},${VALID_FOR}) $(my_local_check)
HERE

micha · January 7, 2022, 1:20pm

Hi @moritz

thanks for your suggestion.

This still does not look like the right solution to me. I am still thinking that a fix on the server side would be the better way (especially for short term).

Your solution would need to

distinguish between
** checks which are cached less than the check interval
** checks which are cached longer than the check interval
** (and check interval is not really known on client side)
define a new directory to place the checks with a cache time less than the check period (to avoid
processing by checkmk caching)
create a plugin which searches all files in this directory and wrap it like you suggested

So we

have different ways depending on the cache-time vs. check interval
a need to move checks to other locations if we change the check interval on server side (so efffectively we may not change the check interval)
for a clean solution effectively a need to tell the client the check interval
as we have many 45s-Checks and also longer cached checks this would mean confusion for our users

What are the cons to the solution I suggested?:

There is an existing caching infrastructure which is working well
Currently there is just a “misinterpretation” of cached data which was not updated as there was no trigger for it
effectively its just a precision of the definition of stale from “older than cachetime” to “older than cachetime AND last check/check interval”

This would

need no change on the client side
keep a clean and simple way to cache checks
allows to change the check interval as needed
is much easier to implement and fits
is consistent with the checkmk 1.x behaviour

I do see only pros for this solution and no pros in a definition on client side.
What am I missing?

Currently it seems much easier to patch the checkmk after each update…

Regards
Michael

PS: Putting your solution in plugins/60 would also fail if we change the check interval to e.g. 2 minutes) leading to no alarming at all

system · January 7, 2023, 1:20pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.