For hosts checked 5-minutely, some graphs show 1 minute spikes of activity and 4 minutes of silence

We have some network devices checked with SNMP every five minutes (because some of them struggled with replying minutely, I’m told).

The graphs below are from the same device. The CPU util graph is what I would expect (low resolution data), but the bandwidth chart shows spikes when the checks are happening.

The spikes seem to account for the whole five minutes of traffic, I assume because this is backed by a counter. The average figures seem correct, but obviously the maximums are five-fold wrong.

Is there a solution to this?

You need to adjust the RRD definition for these services. The RRD expects a new metric value every 60s (the step). If no values arrive for several steps these spikes occur in the graphs.
Adjust the step to your 5 minute check interval.

See Performance data and graphing - Evaluating measured values in Checkmk quickly and easily for more details.

After creating a new configuration rule for the RRD definition you have to convert the existing RRDs with cmk --convert-rrds -v.

Objection: step defines only the minimal time between two values for the RRD. The maximum time between two values is the heartbeat.
The problem in screenshot is something else.

It looks more like the service is checked every minute against the same counter and this leads to 0 as result for the traffic. After the counter update you have one time a result and then 4 times the value 0.

What version of CMK is used here?

Yes, but usually RRD sets NA values if it does not receive 50% of the steps for the next phase. The second phase tries to calculate an average, maximum and minimum over 5 values. With a check interval of 5 minutes you only get one value within the 5 steps needed for phase 2.

But this is not the problem here. If it is a consolidation problem then we would see only “unknown” values from the 5 minutes up. Inside the screenshot you see the minute interval and there only every 5 minutes one value.

It is also stated in the post that the average values over time are correct.

It sounds more like some type of caching problem for the snmp data.
@bmst it would be good to see the detail page of one of the interface services and also for the corresponding Check_MK service.
The CPU graph is ok as it is not calculating a difference between two values. It only outputs the returned value.

Without this information about the configuration of the Check_MK service it is not possible to say where the problem comes from.

So we have “Check intervals for SNMP checks” rule setting 5 minutes

Interface service detail looks like

check_mk service

image

I assume you’re looking at the check periods there? The snmp service is 5 minutely, I don’t know what influence the retry interval has though.

It does sound to me like the RRD interval should match the normal check interval, though I’m not super familiar with RRD. Why would I want minutely resolution on data I expect 5 minutely after all.

Problem found. Your CheckMK service needs the same check interval as you have it configured for the SNMP services.
What you see also here is that CheckMK is interpreting the data as cached data. That is a big problem as you have a counter check working with cached data. Every 5 minutes only this data is refreshed and gives you a difference for the counter check. The next minute the check works with the same cached data. Difference is now 0 and the check result is also 0.
If you really want a 5 minute check interval on your network device it is better to use the rule “check interval for service checks” not the special SNMP rule. The “check interval for service checks” also include the CheckMK service.

1 Like

I’m trialing this on a handful of hosts, looks promising. It occurs to me that the down-sampling of the RRD for longer timescales has a threshold (50%?) of known data points that are required, unless I’m remembering wrong. Is the RRD file still minutely? Is my data (now only 20% present) going to disappear when it gets old?

This is not an RRD problem. You check interval is wrong configured.
Please remove the “Check intervals for SNMP checks” rule. Create for this host a “Normal check interval for service checks” rule with 5 minutes. Now the Check_MK service should have the same check interval as your SNMP check.
And important inside the “Cached agent data” line should be nothing inside.

I understand that, I am now seeing the graphs I wanted. Thanks for that :slight_smile:

However, I am worried that the RRD step size in tandem with the “Percentage of points below which an interval is unknown” will result in my data being discarded instead of down sampled/aggregated as it ages. Do I need to also make rules for the RRD config in sync with the check interval?

Four out of five 60 second periods will have no data now, won’t it? That’s considerably less than 50% and will become “unknown” when it reaches the next age threshold.