Get_rate problem with cluster check

Hi,
I created an extension for monitoring rate of requests of ISC Bind and Postfix mails. There is used a storage for counters of requests and get_rate is used to calculate the rate (req/sec or mail/min). I have a problem for the check if it is a clustered check, because the counter is not always updated from parent host and the result is simple zero - 0!
I can demonstrate the problem on builtin interface check. The two screenshots.

This is interface from production, where Interface is normal check on vidle1 host:

This is testing check defined as a clustered check on host vidle - cluster on parent host vidle1 in testing monitoring site:

The problem with network counter processing is visible there. Check details and perfometer show 0. On the graph on the other hand the effect is not so much visible. RRD averaging probably compensate the effects of “impulses” a bit.

I don’t know if this problems affects CRE only.

The problem is the counter updates are not synchronous with the cluster check. Two consecutive cluster checks process the same value of the counter resulting in zero rate. When the counter was updated, the last time period is fractional by N then resulting in rate N-times grater.

I solved the problem for cmk-isc-bind-stat and cmk-mailperf by adding timestamp into their sections and calculate the rate with this timestamp. If the check runs and the timestamp is not advanced, than get_rate throws exception and data in the checks description and info are without change (PENDING state). I hope this is the best I could do.

1 Like