Is there a minimum service check interval for graphs?

CMK version: Checkmk Raw Edition 2.4.0p17
OS version: Alma 10

Is there a minimum “interval for service checks” to make checkmk generate graphs correctly?

When service checks have an interval of <= 2 hours, graphs are generated.

When service checks have an interval of >= 4 hours, graphs are empty.

This seems to happen with every type of service\check.

Below examples of this behavior with a “Check hosts with PING (ICMP Echo Request)” check. The two services shown were created just to test this issue and they are both ping to 127.0.0.1; the only difference is the check interval.

2-hour check interval service (graph is ok):

4-hour check interval service (graph is empty):

While the graphs are broken, according to the service stats\summary the check is working correctly even with a 4hr check interval:
OK - 127.0.0.1 rta 0.010ms lost 0%

Yes your service needs a check interval that is smaller than the heartbeat of the rrd files that stores the data.

Problem here is - this heartbeat setting cannot be changed easily as there is no setting available inside the GUI.

The system internal heartbeat is 8460 seconds.

If I want to store performance data I would recommend a way smaller interval than that for the check interval.

how can I change it via the CLI?


some checks cost money (see azure\aws checks for example) or it could make no sense to perform them too frequently

thanks

As for changing the heartbeat, quoting myself from here:

For a bit more background about this behaviour, read this: Graphs display data for stale periods

From that thread:
CRE users can change the RRD heartbeat value of 8460 seconds inside the config file ~/etc/pnp4nagios/process_perfdata.cfg, CEE users would need to change it inside a python file of Checkmk: ~/lib/check_mk/base/cee/rrd.py (but this would be overwritten by updates).

Thanks @Jay2k1

In the meantime i asked chatgpt and i’m trying this change to the ~/lib/python3/cmk/rrd/rrd.py file:

# Dynamic heartbeat based on step
if step > 8460:
rrd_heartbeat = int(step * 1.2)
else:
rrd_heartbeat = 8460

this should make the hearbeat dynamic and tied to the check interval when the interval is > 8460.
not sure yet it makes sense\will work.

also, afaik it would be dynamic when the graph is initially created.
but if a service check interval is later changed to a longer one, the graph issue would still be present for that service.

so maybe a static long heartbeat (24+ hours) is better.
with the advantage that it should be configurable via the process_perfdata.cfg file and not using a code hack (to be reapplied after any checkmk update)

Attention step ist not the check interval. Step is the time between data points in seconds inside a rrd file that data is expected. You need to define a rrd creation rule inside checkmk (enterprise only) to get a different step value for a specific service.

For non enterprise installations you need to modify the process_perfdata.cfg file.

You’re right.

Once again AI just made up an answer that doesn’t work: after that change the behavior seems the same as before.

I’ll now try setting RRD_HEARTBEAT = 90000 (25 hours) in ~/etc/pnp4nagios/process_perfdata.cfg

This should work for all new created RRDs.

Since the online doc does not mention this at all and the inline help seems to speak Klingon in that regard - what is the meaning of those %-Values?

Messwerte und Graphing - Messwerte in Checkmk schnell und einfach auswerten

The percentage is not so complicated to describe.

If you keep the default 50% it means that at the next level of aggregation 50% of the aggregated data points must contain a value.

Example - first aggregation uses 5 steps (normally 5 times 60 second intervals) to build one aggregated value. Now we need 50% of these 5 values (3) to get a valid aggregated value.

This article RRDtool - rrdcreate and the following paragraph gives a little bit more insight to the doing of rrdtool.

This is from the documentation about the 50% value –> here known as xff

xff The xfiles factor defines what part of a consolidation interval may be made up from *UNKNOWN* data while the consolidated value is still regarded as known. It is given as the ratio of allowed *UNKNOWN* PDPs to the number of PDPs in the interval. Thus, it ranges from 0 to 1 (exclusive).
1 Like