IBM Chassis Power Module Check (blade_powerfan) Controller Status False Critical?

CMK version: 2.4.0p21.cme
OS version: RHEL 9

Error message:

Admittedly, I don’t remember if these checks were present before migrating from 2.3.0p38.cme to 2.4.0p21.cme so can’t tell if they were OK prior or just not detected/present. After the upgrade I did a bulk discovery to handle multiple plugin updates.

The issue highlighted here is the check for blade_powerfan is returning as Critical for Power Module Cooling device in an older IBM Chassis.

Output of “cmk --debug -vvn hostname”: (If it is a problem with checks or plugins)

Power Module Cooling Device 1 Speed: 59.00%, RPM: 5973.0, Controller state: not OK(!!)
Power Module Cooling Device 2 Speed: 59.00%, RPM: 5888.0, Controller state: not OK(!!)
Power Module Cooling Device 3 Speed: 59.00%, RPM: 5888.0, Controller state: not OK(!!)
Power Module Cooling Device 4 Speed: 60.00%, RPM: 5994.0, Controller state: not OK(!!)

Digging around, I located firmware with a MIB file and what I believe is the MIB/OID is for the ‘Controller state’ .

       fanPackControllerState    OBJECT-TYPE
                  SYNTAX  INTEGER {
                     operational(0),
                     flashing(1),
                     notPresent(2),
                     communicationError(3),
                     unknown(255)
                  }
                  ACCESS  read-only
                  STATUS  mandatory
                  DESCRIPTION
                  "The health state for the controller for the fan pack.
                  0 = operational, 1 = flashing in progress, 2 = not present, 3 = communication error,
                  255 = unknown"
                  ::= { fanPackEntry 7}

I can query it via snmpv3 and get what looks like an ‘operational’ response.

.1.3.6.1.4.1.2.3.51.2.2.6.1.1.7.1 0
.1.3.6.1.4.1.2.3.51.2.2.6.1.1.7.2 0
.1.3.6.1.4.1.2.3.51.2.2.6.1.1.7.3 0
.1.3.6.1.4.1.2.3.51.2.2.6.1.1.7.4 0

Since this is an inventoried check I can’t override the status so have instead disabled the blade_powerfan for several chassis in the short term.

Is this a check maintained by CheckMK or something from Nagios? Any thoughts on a way to correct it or should I leave it disabled?

Sincerely,
Scotsie

The check plugin assumes the state as OK when there is a 1 at these OIDs:

1 Like

@r.sander thanks for taking a look and the code line.

Looking at the legacy check, it looks like it was the same. I don’t have familiarity with other products but does this look like something that should be updated in the check based on the MIB descriptor or are there variances that might explain it?

I could change the behavior locally and just remember to do it each update but I know I’ll likely forget every time.

Sincerely,

Scotsie

Are there different versions of the MIB for different versions of this device’s firmware?

Sometime manufacturers put completely different things on the same OID.

But in this case I tend to blame the implementor (@moritz ) of this check plugin by not looking at the MIB. It should be 0 instead of 1 for OK. :smirking_face:

1 Like

I’ve looked back through several releases of firmware for the IBM Chassis and it’s consistent but, as you say, I don’t have other devices for comparison to see if the vendor fiddled.

As a short term fix, I updated the values in the cmk code locally, cleared the pycache and did an omd restart.

    yield Result(
        state=State.OK if fan.ctrlstate == "0" else State.CRIT,
        notice=f"Controller state: {'' if fan.ctrlstate == '0' else 'not '} OK",
    )

I hope these older IBMs get phased out in the near future so it’s not the end of the world but I did want to bring it to someone’s attention.

@r.sander thank you, as always, for your time and attention.

Sincerely,

Scotsie

1 Like