CheckMK reports input errors on switch interface but switch port statistics says otherwise

CMK version: 2.3.0p22

OS version: Container

Error message:
[FibreUplink1], (up), MAC: 74:8E:F8:D7:D0:40, Speed: 10 GBit/s, In: 127 kB/s (0.01%), Out: 27.6 kB/s (<0.01%), Errors in: 0.035% (warn/crit at 0.01%/0.1%)
[FibreUplink2], (up), MAC: 74:8E:F8:D7:D0:40, Speed: 10 GBit/s, In: 67.3 kB/s (0.01%), Out: 40.7 kB/s (0.01%), Errors in: 0.051% (warn/crit at 0.01%/0.1%)(!)
Service Flapping

I have a Brocade 6450 and Brocade 7250 connected via two fibre sfp+ interfaces and it’s been nice and stable for over a year. Without any changes, recently CheckMK has been reporting that there are error in errors on the two sfp+ interfaces on the 6450. There are no errors on the 7250 on the other side.

It will error and then clear after a minute or two, repeatedly.

However, if I log into the switches and check the port statistics on both ports they report that the number of errors and bad packets are 0.

I’ve tried changing the fiber connections, transceivers etc, restarting the switch, power cycling the switch, downgrading CheckMK a couple of versions and the errors are still appearing. But as I said, the actual switch port stats on the switch are saying they are fine.

What further tests can I do to see if these are genuine errors, and what fixes can I put in place other than just tweaking the sensitivity of the rule?

Port 1/2/3:FibreUplink1 Realtime Information
Status: Up MAC Address: 74-8e-f8-d7-d0-73
Actual Speed/Mode: 10G-full Monitor: None
Mirror: None Lock Adddress: Disable
QOS: 0 Flow Control: Enabled
Tag: Enabled Gig Port Default: Default(Neg-Full-Auto)
Trunk: 2 State: Forward
Connector: Fiber VLAN: 0
Route Only: Disabled
Port Statistic
InOctets: 16425668627 OutOctets: 4139005245
InPkts: 14877999 OutPkts: 9347772
InBroadcastPkts: 209694 OutBroadcastPkts: 31030
InMulticastPkts: 38114 OutMulticastPkts: 9528
InUnicastPkts: 14630191 OutUnicastPkts: 9307214
InBadPkts: 0 InFragments: 0
InDiscards: 0 OutErrors: 0
CRC: 0 Collisions: 0
InErrors: 0 LateCollisions: 0
InGiantPkts: 234 InShortPkts: 0
InJabber: 0 InFlowCtrlPkts: 0
OutFlowCtrlPkts: 0

Port 1/2/4:FibreUplink2 Realtime Information
Status: Up MAC Address: 74-8e-f8-d7-d0-74
Actual Speed/Mode: 10G-full Monitor: None
Mirror: None Lock Adddress: Disable
QOS: 0 Flow Control: Enabled
Tag: Enabled Gig Port Default: Default(Neg-Full-Auto)
Trunk: 2 State: Forward
Connector: Fiber VLAN: 1
Route Only: Disabled
Port Statistic
InOctets: 12765982314 OutOctets: 4981123625
InPkts: 12013380 OutPkts: 10458687
InBroadcastPkts: 31294 OutBroadcastPkts: 365060
InMulticastPkts: 29914 OutMulticastPkts: 44374
InUnicastPkts: 11952172 OutUnicastPkts: 10049253
InBadPkts: 0 InFragments: 0
InDiscards: 0 OutErrors: 0
CRC: 0 Collisions: 0
InErrors: 0 LateCollisions: 0
InGiantPkts: 192 InShortPkts: 0
InJabber: 0 InFlowCtrlPkts: 0
OutFlowCtrlPkts: 0

Usually, I try to understand the root cause, so my question would be: If you cannot pin it to a Checkmk update, is it possible there was a firmware update on the switch side of things? That would at least point you into a direction. Also, did you check the SNMP walk output of that device? Because I am positive, that the SNMP stack actually reports the errors, so it would come down to understanding, why the UI differs from the SNMP output.

Hi Robin,
Thanks for responding! There have been no changes on the switch, it’s on the last available firmware and I haven’t changed the config which is why I’m a little puzzled. The errors only seem to be happening once every few days, do you know the easiest way to find the SNMP OID from the alert?
Thanks,
Andy

What you see there inside the output is not the complete list of possible errors.
An older check from @thl-cmk we used at some “problem” devices to see more error information.

It fetches the following error types.

Alignment errors
Deferred transmissions
Excessive collisions
FCS errors
Frame too long
Internal MAC receive errors
Internal MAC transmit errors
Late collisions
Multiple collision frames
Sarrier sense errors
Single collision frames
Symbol errors
SQE test errors

This are a little bit more than the information shown on your command line output.

That’s not quite true: on the first interface InGiantPkts: 234 and on the secondInGiantPkts: 192.

3 Likes

:rofl: it was not sounding like an error
But from the documentation it gets clear.

in-giant-pkts
A count of frames received on a particular interface that
exceed the maximum permitted frame size.

Wow I think you might be on to something guys, thanks so much!

I had enabled jumbo frames on the upstream switch where my servers are connected but turns out I hadn’t done it on the downstream 6450. I’ve enabled it now, hopefully that’s fixed it. Maybe SNMP was reporting the InGiantPkts as errors?

I’ll see what happens this week, normally takes a few days to occur.

Yes - the normal SNMP error counter is a combination of most of the specific error counters.

@andy68man please do not forget to mark the chosen answer as the solution, as soon as you can verify it, so others can quickly find it. Thanks!

1 Like

Thanks @robin.gierse @andreas-doehler and @thl-cmk for helping me figure this out!

Just to recap for anyone else that finds this thread, basically SNMP groups all the “error” type metrics together and I didn’t realise that InGiantPkts were errors. I already had enabled jumbo frames on the upstream switch for my servers, and recently moved my server from VMware to ProxMox. For some reason, ProxMox started sending jumbo frames down the uplink and this was what was causing the issues. Once I enabled jumbo frames on the 6450 I haven’t had a error since, thanks everyone!

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.