Monitor number of stale services

jplitza · May 9, 2022, 7:13am

Hi everyone,

I’d like to have a check monitor the number of stale services in CRE, since I find they are the most reliable indicator of performance problems. Is there any “native” way to do this, or do I have to write a custom check that queries them via REST API? The OMD performance check only seems to report the number of services in total and the number of service checks per second (and also doesn’t allow to set lower levels on the latter).

openmindz · May 10, 2022, 5:40pm

Hi @jplitza

I have only an idea that might help: One can query livestatus information with lq. The
livestatus tables hosts and services have a staleness column. So if you query e.g.
your hosts table with the columns name and staleness i.e.

lq "GET hosts\nColumns: name staleness"

You get a list, similar to this one (hostnames intentionally modified):

HOST1;0.95
HOST2;0.95
HOST3;0.95
HOST4;0.683333
HOST5;0.95
HOST6;0.0666667

The value on the “right hand” is always rising when you execute the query again, until it reaches 1 (or slightly above 1), and then “goes down”. I believe that this represents a “check cycle/interval”. If one filters this with staleness > 1 and counts that, one might be able to check the “number of stale services”, this way.

Obviously, one still has to write a check for this. I hope I haven’t made a mistake in my thinking, and am leading you on a completely wrong path…

Perhaps someone from the forum can confirm or refute this, and/or has a “different/better idea”. In the meantime, here is the official livestatus documentation, with lots of helpful hints and examples:

5.3.2: Retrieving status data via Livestatus

HTH,
Thomas

andreas-doehler · May 10, 2022, 8:31pm

Your assumption is correct.

I only would change the value for the compare to something like 2 or 3. Inside RAW edition setups it is very common that after a core restart you have to wait for 4 or 5 check cycles until all services are checked a first time.

I think the way you described is correct, only the small modifications i wrote to the staleness value.

openmindz · May 10, 2022, 9:46pm

Thanks Andreas, much appreciated!

Thomas

jplitza · May 11, 2022, 6:44am

Thanks for the research! Since the existing OMD checks already do livestatus queries (I guess), this shouldn’t be too hard to implement.

jplitza · May 11, 2022, 6:47am

This is mostly caused by the default setting max_service_check_spread=5 in ~/etc/nagios/nagios.d/tuning.cfg. Setting it to 1 greatly reduces this staleness period.

jplitza · May 11, 2022, 1:58pm

I ended up with this small script in local/lib/nagios/plugins/check_cmk_stale_services which I configured as Nagios plugins for the monitoring server itself:

#!/bin/sh

set -eu

NUM_STALE_SERVICES="$(lq 'GET services\nStats: staleness >= 1.5\nFilter: host_state = 1\nFilter: check_type = 0')"
WARN="${1:-10}"
CRIT="${2:-100}"

echo "${NUM_STALE_SERVICES} stale active services (warn/crit at ${WARN}/${CRIT}) | stale_services=${NUM_STALE_SERVICES};${WARN};${CRIT};0"

if [ "$NUM_STALE_SERVICES" -gt "$CRIT" ]; then
    exit 2
elif [ "$NUM_STALE_SERVICES" -gt "$WARN" ]; then
    exit 1
else
    exit 0
fi

Obviously only works in single-site installations, but that’s fine for now. I like that I can easily check the number of stale active checks, without the number being inflated by passive checks that weren’t updated by the active Check_MK check.

openmindz · May 11, 2022, 4:12pm

Cool, thanks for sharing!

system · May 11, 2023, 4:12pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.