HPE Integrated Lights-Out (ILO) 5 3.0 breaks storage monitoring on SNMP Management Board

CMK version:
2.2.0p14
OS version:
RHEL8

Error message:
None, the services for all storage devices disappear.

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
I guess the important part is:
Management Interface: HW Phydrv 0/1 Bay: 4, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/2 Bay: 3, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/3 Bay: 2, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 240056795136MB, Condition: other
Management Interface: HW Phydrv 0/4 Bay: 1, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 240056795136MB, Condition: other
Management Interface: HW Phydrv 0/5 Bay: 5, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/6 Bay: 6, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/7 Bay: 7, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: Logical Device 1 Status: other(?), Logical volume size: 224 GiB
Management Interface: Logical Device 2 Status: other(?), Logical volume size: 6.99 TiB

We flashed the iLO firmware to Integrated Lights-Out 5 3.00 (Dec 14 2023), from Integrated Lights-Out 5 2.97.

I’m guessing something has changed in the SNMP output in version 3.00, I’m just looking for the exact OIDs to check…

Running an snmpbulkwalk on the OID shows:

.1.3.6.1.4.1.232.3.2.5.1.1.37.0.1 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.2 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.3 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.4 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.5 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.6 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.7 = 1

Which maps to:

_MAP_CONDITION = {
“0”: “n/a”,
“1”: “other”,
“2”: “ok”,
“3”: “degraded”,
“4”: “failed”,
}

So all the drives are in an “other” state…


But the iLO web interface says they’re all fine…

HP have acknowledged this in this notice: Advisory: HPE Integrated Lights-Out 5 (iLO 5) - Some MIB Entries in Table cpqDaCntrlTable, cpqDaLogDrvTable and cpqDaPhyDrvTable Return Status “other(1)” Instead of “ok(2)” Upon SNMP Query with iLO5 Firmware v3.00 (or Later)

Hi Stephen,

I think this was fixed here:

It is not only affecting SNMP also all other types of management data retrival.
Redfish / IPMI and so on.
Also 3.01 is not a fixed version - same problems there.
Downgrade to 2.97 got the system in a working state again.
Side effect of this problem is also that your iLO can become unresponsive and could only be restarted over SSH.

That seems to remedy a different issue…

I upgraded the servers to 2.2.0p21 and reran discovery with these results:

It looks to have gotten worse! Two random logical devices called \x00 and \x01 now exist in the monitoring data and the drive status doesn’t seem to have chanced from “Condition: other”

Yes the fix is only for invisible characters in the name string not the problem with all storage data.

Yes, we cannot downgrade to 2.97 as our security team wanted 3.00 for security fixes…

Having security patches is more important than monitoring I fear.

Thanks for the heads up on the unresponsivity issue!

You can say to your security guys - should the server work or not? :smiley:

The problem is more that the issue are happens randomly.
I had 3 servers, same model same configuration and only on one a broken iLO after upgrade.

The version protocol shown no security related points for 3.00 and 3.01

I have it broken on 21 separate HP servers so far…

Luckily it’s not the production environment!

I guess the security guys didn’t do their homework and just wanted the latest version of the iLO firmware… They’re also pretty hard-headed and won’t back down on keeping it on the latest version :confounded:

I know the problem and feel with you.

Ah, I haven’t seen that yet.

Same issue here with ILO 3.0.1 and CMK 2.2.0p21. I wanted to test the monitoring of the “new” Broadcom MegaRaid controllers via SNMP, which was not possible in earlier ILO versions. Before ILO 3.0, I hadn’t seen the logical device service, which is now visible, but only in the “UNKN” state.

2 Likes

Same problem, ILO5 version 3.0.1 Blade BL20 Gen 10 and CMK 2.2.0 p12

1 Like

Same problem, ILO5 version 3.0.1 HPE DL380Gen10+ and CMK 2.2.0 p18

With ILO5 firmware versions prior to 3.0.0, MR416 array controllers & attached logical and physical drives cannot be monitored through SNMP.

With firmware versions 3.0.0 and later, the info is picked up by SNMP but all status are put to “other” instead of “OK”.

According to HPE this behavior is to be expected??
https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00137679en_us

The quality of their firmware is questionable if it breaks monitoring of vital hardware…

1 Like

Yeah, this is a real pain in the *** that HPE have caused here…

They say it’s now expected behaviour, and I don’t think enough customers are kicking and screaming loud enough to cause them to back down on this.

So we’re stuck with two possible courses of action:

  1. Sit and wait to see if HPE back down and reverse this behaviour in future firmware.
  2. Modify CheckMK monitoring code to show OK if SNMP status is in an unknown state when using HP firmware that exhibits this behaviour

The second option has lots of potential pitfalls though. What status will HPE iLO return if there is a hardware failure? What if HP do chance things in a future firmware? Etc.

Not sure there’s an easy or right course of action here :confused:

1 Like

I think everyone running into this problem should open a support case with HPE and complain about that problem to raise awareness for the bulls… they are doing there.

1 Like


Such a useless company… :confounded:

Small Indi -company :rofl: