How to monitor network interface bandwidth properly

JDamian · August 4, 2023, 10:58am

CMK version: 2.0.0p28 CME
OS version: RHEL 8.2

I have a problem about monitoring net interface bandwidth because CRITICAL alerts are displayed on 10 Gbps interfaces having input and output values over their capacity.

Check the alert below as an example – shown 44.5 GB/s whereas its speed is 10 Gbps.

Eth1/3 Admvo_281000028274, Ethernet1/30], (op. state: up, admin state: up), MAC: 3C:26:E4:2A:B1:08, Speed: 10 GBit/s, In average 15min: 44.5 GB/s (warn/crit at 875 MB/s/1.12 GB/s) (3562.53%) **CRIT**, Out average 15min: 3.29 GB/s (warn/crit at 875 MB/s/1.12 GB/s) (263.35%):

Obviously it is a problem because of the alerts Check_MK throws. See the values and alerts reported for last 8 days

The solution reported by support is “In your case the first obvious thing is the increased polling interval.” according to this page they shared. Surprisingly they did not suggest to update to the latest release as they usually do… that gives me no hope that they will solve this issue.

The “Average values for used bandwidth” is 15 minutes. Is it too long? Is there a proper value?

Do anyone know any other tool/plugin to check interface bandwdith properly?

Best regards

https://kb.checkmk.com/display/KB/Displayed+interface+bandwidth+bigger+than+interface+speed

andreas-doehler · August 4, 2023, 11:50am

I have more than 200k network ports inside all the CMK systems supported from myself or my colleagues and no such problems.
What is the check interval configured on this device?
Is in your system any of the following rules defined?

Fetch intervals for SNMP sections
Normal check interval for service checks
That affect the ports on this network device.

openmindz · August 5, 2023, 2:11pm

Hi @JDamian

Interesting, so you’re basically running into the issue I witnessed, too:
https://forum.checkmk.com/t/re-used-bandwidth-on-specific-interface-averaged-over-5-mins/40442

Despite the fact that we have fairly large setups too, apparently
“bandwidth calculation” over “averaged time periods”, may be an issue
after all, if one reads the KB article you posted:

Displayed interface bandwidth bigger than interface speed

For me, this is a “tacit” admittance, that there is indeed something wrong, and it’s
not “our fault”.

Thomas

JDamian · August 7, 2023, 7:46am

Hi andreas

Is in your system any of the following rules defined?

Fetch intervals for SNMP sections —> No such rule is defined
Normal check interval for service checks —> Yes, the value configured is 10 minutes

Sara · August 7, 2023, 9:08am

Hello all,

we are researching this, and your help could be invaluable in this research. I know @openmindz is already in contact with our team, thank you!

If you’re also seeing this behavior in your system and would like to help us research, please write me a message.

GarthH · August 13, 2023, 9:48pm

Hi Sara / all

We are seeing this exact issue in an old CRE1.6 (nagios core) server on Centos 7. (we have enterprise licensing and project is in progress to upgrade it to CEE 2.0 ). The switch being monitored is a Cisco Nexus 9K (N9K-C93180YC-FX) and its a 1GB port. check interval is 1min. No other ports are showing this issue.

We thought it might be a pnp4nagios issue so we deleted the rrd for the interface and let it regenerate but same issue came back.

We have an even older monitoring server running checkmk 1.2.x (icinga2) monitoring the same class of switch and no sign of this issue.

I dont know if this info has any usefulness to your investigation due to the age of our applications and no micro core, but we stumbled upon this thread doing our own research and thought it was interesting. I will update once we have upgraded the DR for the server to CEE 2.0/CMC and see if anything changes. If there is any more info I can provide that may be of use, let me know.

cheers
Garth

Sara · August 14, 2023, 11:43am

Thank you @GarthH for sharing!

It could be useful for the investigation for sure. And all updates would be welcome too!

GarthH · August 24, 2023, 10:53pm

Hi @Sara

We just noticed something interesting, our devices with this issue have to be excluded from using Inline-SNMP. Could that be related?

cheers
Garth

jodok.glabasna · August 25, 2023, 11:00am

I don’t think that this is related, but please go ahead and prove me wrong

In this thread post I provided a modified interfaces.py that can be put to the~/local structure and
overshadows the factory version of that interfaces check.
It is just extended to log the raw rx/tx octest it got from the device and some related, calculated vars.

https://forum.checkmk.com/t/weird-spikes-on-interfaces/39480/14

Perhaps it is possible for you to use it on a test instance of your checkmk for the device ?
If you have question please do not hesitate to contact my by private message.

My interest in this topic is to mythbuster that checkmk does it wrong.

GarthH · September 17, 2023, 10:19pm

thanks @jodok.glabasna my colleague told me this morning that our issue is the known cisco bug Bug Search Tool

Just posting here for future reference

jodok.glabasna · September 18, 2023, 5:23am

Thank you GarthH,

JFYI, not only Cisco but also some Juniper devices seem to have such a bug
as mentioned in the other thread https://forum.checkmk.com/t/weird-spikes-on-interfaces/39480/22

system · September 17, 2024, 5:23am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.