CheckMK reports input errors on switch interface but switch port statistics says otherwise

andy68man · December 14, 2024, 8:57pm

CMK version: 2.3.0p22

OS version: Container

Error message:
[FibreUplink1], (up), MAC: 74:8E:F8:D7:D0:40, Speed: 10 GBit/s, In: 127 kB/s (0.01%), Out: 27.6 kB/s (<0.01%), Errors in: 0.035% (warn/crit at 0.01%/0.1%)
[FibreUplink2], (up), MAC: 74:8E:F8:D7:D0:40, Speed: 10 GBit/s, In: 67.3 kB/s (0.01%), Out: 40.7 kB/s (0.01%), Errors in: 0.051% (warn/crit at 0.01%/0.1%)(!)
Service Flapping

I have a Brocade 6450 and Brocade 7250 connected via two fibre sfp+ interfaces and it’s been nice and stable for over a year. Without any changes, recently CheckMK has been reporting that there are error in errors on the two sfp+ interfaces on the 6450. There are no errors on the 7250 on the other side.

It will error and then clear after a minute or two, repeatedly.

However, if I log into the switches and check the port statistics on both ports they report that the number of errors and bad packets are 0.

I’ve tried changing the fiber connections, transceivers etc, restarting the switch, power cycling the switch, downgrading CheckMK a couple of versions and the errors are still appearing. But as I said, the actual switch port stats on the switch are saying they are fine.

What further tests can I do to see if these are genuine errors, and what fixes can I put in place other than just tweaking the sensitivity of the rule?

Port 1/2/3:FibreUplink1 Realtime Information
Status: Up MAC Address: 74-8e-f8-d7-d0-73
Actual Speed/Mode: 10G-full Monitor: None
Mirror: None Lock Adddress: Disable
QOS: 0 Flow Control: Enabled
Tag: Enabled Gig Port Default: Default(Neg-Full-Auto)
Trunk: 2 State: Forward
Connector: Fiber VLAN: 0
Route Only: Disabled
Port Statistic
InOctets: 16425668627 OutOctets: 4139005245
InPkts: 14877999 OutPkts: 9347772
InBroadcastPkts: 209694 OutBroadcastPkts: 31030
InMulticastPkts: 38114 OutMulticastPkts: 9528
InUnicastPkts: 14630191 OutUnicastPkts: 9307214
InBadPkts: 0 InFragments: 0
InDiscards: 0 OutErrors: 0
CRC: 0 Collisions: 0
InErrors: 0 LateCollisions: 0
InGiantPkts: 234 InShortPkts: 0
InJabber: 0 InFlowCtrlPkts: 0
OutFlowCtrlPkts: 0

Port 1/2/4:FibreUplink2 Realtime Information
Status: Up MAC Address: 74-8e-f8-d7-d0-74
Actual Speed/Mode: 10G-full Monitor: None
Mirror: None Lock Adddress: Disable
QOS: 0 Flow Control: Enabled
Tag: Enabled Gig Port Default: Default(Neg-Full-Auto)
Trunk: 2 State: Forward
Connector: Fiber VLAN: 1
Route Only: Disabled
Port Statistic
InOctets: 12765982314 OutOctets: 4981123625
InPkts: 12013380 OutPkts: 10458687
InBroadcastPkts: 31294 OutBroadcastPkts: 365060
InMulticastPkts: 29914 OutMulticastPkts: 44374
InUnicastPkts: 11952172 OutUnicastPkts: 10049253
InBadPkts: 0 InFragments: 0
InDiscards: 0 OutErrors: 0
CRC: 0 Collisions: 0
InErrors: 0 LateCollisions: 0
InGiantPkts: 192 InShortPkts: 0
InJabber: 0 InFlowCtrlPkts: 0
OutFlowCtrlPkts: 0

robin.gierse · December 27, 2024, 9:12am

Usually, I try to understand the root cause, so my question would be: If you cannot pin it to a Checkmk update, is it possible there was a firmware update on the switch side of things? That would at least point you into a direction. Also, did you check the SNMP walk output of that device? Because I am positive, that the SNMP stack actually reports the errors, so it would come down to understanding, why the UI differs from the SNMP output.

andy68man · December 27, 2024, 3:34pm

Hi Robin,
Thanks for responding! There have been no changes on the switch, it’s on the last available firmware and I haven’t changed the config which is why I’m a little puzzled. The errors only seem to be happening once every few days, do you know the easiest way to find the SNMP OID from the alert?
Thanks,
Andy

andreas-doehler · December 27, 2024, 3:42pm

What you see there inside the output is not the complete list of possible errors.
An older check from @thl-cmk we used at some “problem” devices to see more error information.

It fetches the following error types.

Alignment errors
Deferred transmissions
Excessive collisions
FCS errors
Frame too long
Internal MAC receive errors
Internal MAC transmit errors
Late collisions
Multiple collision frames
Sarrier sense errors
Single collision frames
Symbol errors
SQE test errors

This are a little bit more than the information shown on your command line output.

thl-cmk · December 27, 2024, 4:16pm

That’s not quite true: on the first interface InGiantPkts: 234 and on the secondInGiantPkts: 192.

andreas-doehler · December 27, 2024, 6:59pm

it was not sounding like an error
But from the documentation it gets clear.

in-giant-pkts
A count of frames received on a particular interface that
exceed the maximum permitted frame size.

andy68man · December 27, 2024, 8:00pm

Wow I think you might be on to something guys, thanks so much!

I had enabled jumbo frames on the upstream switch where my servers are connected but turns out I hadn’t done it on the downstream 6450. I’ve enabled it now, hopefully that’s fixed it. Maybe SNMP was reporting the InGiantPkts as errors?

I’ll see what happens this week, normally takes a few days to occur.

andreas-doehler · December 27, 2024, 9:01pm

Yes - the normal SNMP error counter is a combination of most of the specific error counters.

robin.gierse · December 30, 2024, 6:36am

@andy68man please do not forget to mark the chosen answer as the solution, as soon as you can verify it, so others can quickly find it. Thanks!

andy68man · January 1, 2025, 2:40pm

Thanks @robin.gierse @andreas-doehler and @thl-cmk for helping me figure this out!

Just to recap for anyone else that finds this thread, basically SNMP groups all the “error” type metrics together and I didn’t realise that InGiantPkts were errors. I already had enabled jumbo frames on the upstream switch for my servers, and recently moved my server from VMware to ProxMox. For some reason, ProxMox started sending jumbo frames down the uplink and this was what was causing the issues. Once I enabled jumbo frames on the 6450 I haven’t had a error since, thanks everyone!

system · January 1, 2026, 2:40pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.