Local check data livetime

Hi,

is ist somehow possible to have some kind of “data needs to be newer than 5 minutes”?

I keep adding a lot of local checks lately (don’t get mad, but some of the check plugins are a pile of poop), and because thes pile of checks get more complex the agent run gets slower and slower.

Now I began to run every script via systemd timers and just create local check valid output and just read those files from /var/lib/check_mk_agent/spool. This works like a charm and with some puppet and ruby it begins to make some fun.
Now there was this one issue, where the check was working, but a “exit 0” in the top of the script prevented it from regenerating the new local check output, which led to a small but anoying service outage.

I know I can work around with another script that checks fileage and replaces the content, but I would preferre not to frankenstein more that necessary.

I would recommend to start the Local Checks asynchronously from the agent as intended: https://checkmk.com/cms_localchecks.html#Caching%20outputs

Spool files can be prefixed with a number that denotes their maximum age.

2 Likes

I just made a test.
This works if the check exits before the defined time, but if the check is running longer than the defined time, check_mk kill the check, uses the last known value and restarts it.

#!/bin/bash
sleep 6000 # this gets added after the first successful runs.
echo "0 testservice - Everything fine"

Exactly. That’s how it works. If you expect your agent plugin to run for 5 minutes, it would be appropriate to put it into a 6-minute-directory (360), but not into 240.

Furthermore: if the plugin gets killed or exits in time but with a bad return code, then its output is discarded. That’s different from the behaviour of “non-detached” plugins. If they, for instance, exit 1, the rc is ignored and the output is sent to the server. If the very same plugin is run from a subdirectory, the rc is honored and its output is discarded.

1 Like

That was my question: Is there a way to alert if the data is too old?

I expect a run time of 60 sec, so I put it into the 90 folder. But what happens if the script does not work as expected? Currently there is no indication for me when I do it the way I did it.

Could you give some more explanation or link to the documentation? I don’t understand what you try to tell me.

Let me explain what happens if you put a regular agent plugin into a subdirectory, say 90. Let’s assume the agent is called every 60 seconds and the plugins runs for 50 seconds:

  • 00s: Agent gets called. It doesn’t see any cachefile below /var/lib/check_mk_agent/cache/ so it starts the agent plugin in background (nohup … &) and redirects its output to a cachefile below /var/lib/check_mk_agent/cache/. The agent returns all other data except that from the plugin.

  • 50s: plugin is done (with rc=0). output is written to /var/lib/check_mk_agent/cache

  • 60s: agent gets called. Sees the cachefile which is only 10 seconds old and thus returns it with <<<...:cached(50,90)>>> in the section header (the 50 being the timestamp of the file and 90 being the max. age). This “decoration” allows the server to check how old the data is and if it’s still valid. The agent doesn’t call the plugin again because 10<90.

  • 120s: agent gets called. Sees the cachefile which is now 70 seconds old (120-50). returns it. but again, doesn’t call the plugin because 70<90 (i.e. the cachefile is not yet outdated).

  • 180s: agent gets called. The cachfile is now 180-50=130 seconds old. I consider this at least surprising, but indeed, that cachefile is returned. now the plugin is called again because 130 (cachefile age) is greater than 90 (directory name).

  • 230s: same as above (50s)

As you can see, putting a plugin into the 90 directory results in plugin calls every 180 seconds. In the meantime, the agent returns the cached file and the server can see from the :cached(timestamp-of-cachefile,90) part in the section header how old the data is and how long it can be considered valid. If the data is too old, the server will show it dithered and tell you that it is outdated.

As for your 2nd question: if it turns out that the plugin that has run in the background exits with a bad returncode (exit 1 instead of exit 0), then the background job simply discards the cachefile that might have been written by the plugin so far and there is no data to return when the agent runs the next time. The agent will then re-start the plugin.

So. This was for “regular” agent plugins. I haven’t looked too deeply in the behaviour of local checks. Unfortunately, they don’t seem to be decorated with that :cached(x,y) thingy when they are returned and that might exactly be your issue.

What you could do is: put your local check plugin somewhere else (outside the checkmk directories), have it called by cron and write its output to a spoolfile, preceeded by a number, like so:

/var/lib/check_mk_agent/spool/90-my-spoolfile

The file must then contain the section header <<<local>>>. This file will only be returned if it is younger than 90 seconds.

1 Like

That’s what I searched for. I’ve must overread it. Thank you a lot.