This isn’t really about an issue in CheckMK, but an issue in two of my iDRACs (both iDRAC 9 Enterprise from PowerEdge R760 systems, no firmware updates available): the disks reported by the iDRAC 9s vanish every couple of check cycles and reappear a handful of cycles later. When they disappear & reappear isn’t deterministic.
Debugging this with tcpdump & comparing the traces of “disks present” session with a “disk not present” session shows that the bulk request for “diskPhysicalName” (OID 1.3.6.1.4.1.674.10892.5.5.1.20.130.4.1.2) receives proper responsesfor some cases and completely unrelated responses in others.
I’ve then run a simple shell look of snmpbulkget -v2c -c$SCOMMUNITY $HOST_IP 1.3.6.1.4.1.674.10892.5.5.1.20.130.4.1.2 | ts every 20 seconds, and here’s what I get when the disks are present:
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.2.1 = STRING: "NVMe 0"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.2.2 = STRING: "NVMe 1"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.3.1 = STRING: "SK hynix"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.3.2 = STRING: "SK hynix"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.4.1 = INTEGER: 3
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.4.2 = INTEGER: 3
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.6.1 = STRING: "Dell NVMe ISE PE9010 GEN4 RI M.2 480GB"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.6.2 = STRING: "Dell NVMe ISE PE9010 GEN4 RI M.2 480GB"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.7.1 = STRING: "AME2N0040I0703218"
Oct 04 13:02:19 SNMPv2-SMI::enterprises.674.10892.5.5.1.20.130.4.1.7.2 = STRING: "AME2N0040I0703217"
and at times the responses suddenly shift to the following for a minute or two:
Oct 04 13:02:39 SNMPv2-MIB::snmpSetSerialNo.0 = INTEGER: 667648279
Oct 04 13:02:39 SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 1F 88 04 4A 53 48 56 54 43 34
Oct 04 13:02:39 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 15
Oct 04 13:02:39 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 526 seconds
Oct 04 13:02:39 SNMP-FRAMEWORK-MIB::snmpEngineMaxMessageSize.0 = INTEGER: 1500
Oct 04 13:02:39 SNMP-MPD-MIB::snmpUnknownSecurityModels.0 = Counter32: 0
Oct 04 13:02:39 SNMP-MPD-MIB::snmpInvalidMsgs.0 = Counter32: 0
Oct 04 13:02:39 SNMP-MPD-MIB::snmpUnknownPDUHandlers.0 = Counter32: 0
Oct 04 13:02:39 SNMP-USER-BASED-SM-MIB::usmStatsUnsupportedSecLevels.0 = Counter32: 0
Oct 04 13:02:39 SNMP-USER-BASED-SM-MIB::usmStatsNotInTimeWindows.0 = Counter32: 0
Those are completely unrelated OIDs!? What the heck is going on here? We don’t have this type of issue with any of our other iDRACs, neither ours nor our customers’. I’ve already tried: dropping down to SNMPv1; updating the firmware (none available); rebooting the iDRACs. All to no avail.
Has anyone encountered this, too? And if so, are there workarounds or even fixes?