Monitor number of stale services

Hi everyone,

I’d like to have a check monitor the number of stale services in CRE, since I find they are the most reliable indicator of performance problems. Is there any “native” way to do this, or do I have to write a custom check that queries them via REST API? The OMD performance check only seems to report the number of services in total and the number of service checks per second (and also doesn’t allow to set lower levels on the latter).

Hi @jplitza

I have only an idea that might help: One can query livestatus information with lq. The
livestatus tables hosts and services have a staleness column. So if you query e.g.
your hosts table with the columns name and staleness i.e.

lq "GET hosts\nColumns: name staleness"

You get a list, similar to this one (hostnames intentionally modified):

HOST1;0.95
HOST2;0.95
HOST3;0.95
HOST4;0.683333
HOST5;0.95
HOST6;0.0666667

The value on the “right hand” is always rising when you execute the query again, until it reaches 1 (or slightly above 1), and then “goes down”. I believe that this represents a “check cycle/interval”. If one filters this with staleness > 1 and counts that, one might be able to check the “number of stale services”, this way.

Obviously, one still has to write a check for this. I hope I haven’t made a mistake in my thinking, and am leading you on a completely wrong path… :slight_smile:

Perhaps someone from the forum can confirm or refute this, and/or has a “different/better idea”. In the meantime, here is the official livestatus documentation, with lots of helpful hints and examples:

5.3.2: Retrieving status data via Livestatus

HTH,
Thomas

2 Likes

Your assumption is correct.

I only would change the value for the compare to something like 2 or 3. Inside RAW edition setups it is very common that after a core restart you have to wait for 4 or 5 check cycles until all services are checked a first time.

I think the way you described is correct, only the small modifications i wrote to the staleness value.

2 Likes

Thanks Andreas, much appreciated!

Thomas

Thanks for the research! Since the existing OMD checks already do livestatus queries (I guess), this shouldn’t be too hard to implement.

This is mostly caused by the default setting max_service_check_spread=5 in ~/etc/nagios/nagios.d/tuning.cfg. Setting it to 1 greatly reduces this staleness period.

I ended up with this small script in local/lib/nagios/plugins/check_cmk_stale_services which I configured as Nagios plugins for the monitoring server itself:

#!/bin/sh

set -eu

NUM_STALE_SERVICES="$(lq 'GET services\nStats: staleness >= 1.5\nFilter: host_state = 1\nFilter: check_type = 0')"
WARN="${1:-10}"
CRIT="${2:-100}"

echo "${NUM_STALE_SERVICES} stale active services (warn/crit at ${WARN}/${CRIT}) | stale_services=${NUM_STALE_SERVICES};${WARN};${CRIT};0"

if [ "$NUM_STALE_SERVICES" -gt "$CRIT" ]; then
    exit 2
elif [ "$NUM_STALE_SERVICES" -gt "$WARN" ]; then
    exit 1
else
    exit 0
fi

Obviously only works in single-site installations, but that’s fine for now. I like that I can easily check the number of stale active checks, without the number being inflated by passive checks that weren’t updated by the active Check_MK check.

1 Like

Cool, thanks for sharing! :slight_smile: