HPE Integrated Lights-Out (ILO) 5 3.0 breaks storage monitoring on SNMP Management Board

solarisfire · February 9, 2024, 4:08pm

CMK version:
2.2.0p14
OS version:
RHEL8

Error message:
None, the services for all storage devices disappear.

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)
I guess the important part is:
Management Interface: HW Phydrv 0/1 Bay: 4, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/2 Bay: 3, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/3 Bay: 2, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 240056795136MB, Condition: other
Management Interface: HW Phydrv 0/4 Bay: 1, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 240056795136MB, Condition: other
Management Interface: HW Phydrv 0/5 Bay: 5, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/6 Bay: 6, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: HW Phydrv 0/7 Bay: 7, Bus number: -1, Status: ok, Smart status: ok, Ref hours: 0, Size: 1920383057920MB, Condition: other
Management Interface: Logical Device 1 Status: other(?), Logical volume size: 224 GiB
Management Interface: Logical Device 2 Status: other(?), Logical volume size: 6.99 TiB

We flashed the iLO firmware to Integrated Lights-Out 5 3.00 (Dec 14 2023), from Integrated Lights-Out 5 2.97.

I’m guessing something has changed in the SNMP output in version 3.00, I’m just looking for the exact OIDs to check…

solarisfire · February 9, 2024, 4:18pm

Running an snmpbulkwalk on the OID shows:

.1.3.6.1.4.1.232.3.2.5.1.1.37.0.1 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.2 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.3 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.4 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.5 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.6 = 1
.1.3.6.1.4.1.232.3.2.5.1.1.37.0.7 = 1

Which maps to:

_MAP_CONDITION = {
“0”: “n/a”,
“1”: “other”,
“2”: “ok”,
“3”: “degraded”,
“4”: “failed”,
}

So all the drives are in an “other” state…

solarisfire · February 9, 2024, 4:19pm

But the iLO web interface says they’re all fine…

solarisfire · February 9, 2024, 4:32pm

HP have acknowledged this in this notice: Advisory: HPE Integrated Lights-Out 5 (iLO 5) - Some MIB Entries in Table cpqDaCntrlTable, cpqDaLogDrvTable and cpqDaPhyDrvTable Return Status “other(1)” Instead of “ok(2)” Upon SNMP Query with iLO5 Firmware v3.00 (or Later)

aeckstein · February 9, 2024, 5:27pm

Hi Stephen,

I think this was fixed here:

andreas-doehler · February 9, 2024, 7:58pm

It is not only affecting SNMP also all other types of management data retrival.
Redfish / IPMI and so on.
Also 3.01 is not a fixed version - same problems there.
Downgrade to 2.97 got the system in a working state again.
Side effect of this problem is also that your iLO can become unresponsive and could only be restarted over SSH.

solarisfire · February 9, 2024, 7:58pm

That seems to remedy a different issue…

I upgraded the servers to 2.2.0p21 and reran discovery with these results:

It looks to have gotten worse! Two random logical devices called \x00 and \x01 now exist in the monitoring data and the drive status doesn’t seem to have chanced from “Condition: other”

andreas-doehler · February 9, 2024, 7:59pm

Yes the fix is only for invisible characters in the name string not the problem with all storage data.

solarisfire · February 9, 2024, 8:00pm

Yes, we cannot downgrade to 2.97 as our security team wanted 3.00 for security fixes…

Having security patches is more important than monitoring I fear.

Thanks for the heads up on the unresponsivity issue!

andreas-doehler · February 9, 2024, 8:07pm

You can say to your security guys - should the server work or not?

The problem is more that the issue are happens randomly.
I had 3 servers, same model same configuration and only on one a broken iLO after upgrade.

The version protocol shown no security related points for 3.00 and 3.01

solarisfire · February 9, 2024, 8:14pm

I have it broken on 21 separate HP servers so far…

Luckily it’s not the production environment!

I guess the security guys didn’t do their homework and just wanted the latest version of the iLO firmware… They’re also pretty hard-headed and won’t back down on keeping it on the latest version

andreas-doehler · February 9, 2024, 8:18pm

I know the problem and feel with you.

aeckstein · February 11, 2024, 2:13pm

Ah, I haven’t seen that yet.

degan · February 12, 2024, 10:42am

Same issue here with ILO 3.0.1 and CMK 2.2.0p21. I wanted to test the monitoring of the “new” Broadcom MegaRaid controllers via SNMP, which was not possible in earlier ILO versions. Before ILO 3.0, I hadn’t seen the logical device service, which is now visible, but only in the “UNKN” state.

toso72 · February 15, 2024, 10:55am

Same problem, ILO5 version 3.0.1 Blade BL20 Gen 10 and CMK 2.2.0 p12

BartD · February 23, 2024, 8:19am

Same problem, ILO5 version 3.0.1 HPE DL380Gen10+ and CMK 2.2.0 p18

With ILO5 firmware versions prior to 3.0.0, MR416 array controllers & attached logical and physical drives cannot be monitored through SNMP.

With firmware versions 3.0.0 and later, the info is picked up by SNMP but all status are put to “other” instead of “OK”.

According to HPE this behavior is to be expected??
https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00137679en_us

The quality of their firmware is questionable if it breaks monitoring of vital hardware…

solarisfire · February 23, 2024, 10:47am

Yeah, this is a real pain in the *** that HPE have caused here…

They say it’s now expected behaviour, and I don’t think enough customers are kicking and screaming loud enough to cause them to back down on this.

So we’re stuck with two possible courses of action:

Sit and wait to see if HPE back down and reverse this behaviour in future firmware.
Modify CheckMK monitoring code to show OK if SNMP status is in an unknown state when using HP firmware that exhibits this behaviour

The second option has lots of potential pitfalls though. What status will HPE iLO return if there is a hardware failure? What if HP do chance things in a future firmware? Etc.

Not sure there’s an easy or right course of action here

aeckstein · February 23, 2024, 11:48am

I think everyone running into this problem should open a support case with HPE and complain about that problem to raise awareness for the bulls… they are doing there.

solarisfire · February 23, 2024, 1:31pm

Such a useless company…

Glowsome · February 24, 2024, 12:03am

Small Indi -company