Aruba CX Temperature hits Warning and Critical by default

Hi,

after updating to CheckMK 2.4.0p9 (from 2.4.0p6), Aruba CX switches offer new temperature services to be monitored (Werk #17934).

Almost all switches immediately hit the warning or even critical threshold after activation.
Looking at the thresholds shown at the service, these are actually the min/max values that the switch reports for that sensor. The lower threshold is therefore reached most of the time and only below, if the current temperature has not been updated in the min-field.

This does not make that much sense. Is this how it was meant to be used, or was the idea that the administrator sets thresholds for each kind of sensor on their own, overriding the defaults?

Maybe “prefer user levels over device levels” is a hint towards this?

Would like to hear some thoughts about how this is meant to be used or if that is rather a bug. Any advice around reasonable default thresholds are much appreciated as well.

By the way: The werk does not do this for CX6100 switches, but all CX switches that we use.

From my point of view, it would make more sense to read the device status supplied in the details of the service and tie this status to the service state.

Aruba states are: normal, max, low_critical, critical, fault, emergency
See Aruba CLI: show environment temperature

Speaking generally, not about Aruba in particular: limits are supposed to model normal working conditions for that specific device. Some customers have most of their switches in temperature-controlled data centers, but some might be located in a small closet with a window facing south, meaning the what constitutes a “normal working condition” depends not just on the device type & regular load, but also on the environment. On top of that you have to take intended operating environment parameters into account that the manufacturer provides.

What I’m trying to say is: factory defaults can never be a one size fits all set of limits. You often have to adjust limits, sometimes for whole device classes, sometimes for specific locations, sometimes for singular devices.

One of my favorite examples for this is that the generic SNMP “temperature” support includes different devices types, e.g. case temperature sensors, hot spot temperature sensors, or spinning hard drives. All have very different normal working conditions.

Well, from a monitoring-perspective, it is more about “what will decrease the lifetime of the device or cause a failure very soon”.
It is very hard to guess when a certain component of the device is getting too hot and the critical temperature is independent from the overall environment.

Also, there are several sensors like for the interfaces, the memory, the ASIC and many more. The latter running at around 80 degrees Celsius all of the time and it seems to be normal, but quite hard to find out at which temperature this is beginning to be harmful.

Like mentioned in my last comment, the output includes the device’s own judgment and that should be used for notifications, while any manual setting should only be seen as an additional measurement.
Unfortunately, device manufacturers do not always publish the temperature ranges for certain components, but rather specify a operating temperature for the environment.

I’m setting ours to Ignore device's own levels which entirely disables the temp check and will only use the status.

LibreNMS handles setting thresholds by taking the temperature (or other value) reading when the device is added and multiplying it by a factor (can’t figure out what it is exactly, but it appears to be about 25% above and below where it started). Though this wouldn’t work easily with how CheckMK stores the limits (would have to be individual rules per service per host).

That sounds like the opposite of what I’d like to achieve.
Ideally, CheckMK would read or “know” the temperature thresholds the manufacturer defined and use those for notifications. Of course, it may make sense to go below those to extend lifetime and so on, but we aren’t there at the moment.

The Werk seems to simply use the last MIN/MAX values, which is kind of bad.

Just found this one in 2.4.0p10 today:

So I think the issue with the values has at least been recognized. Not sure in which direction this change is heading now. I’d guess that the min/max is just not used as warning/critical thresholds. Sounds like they’ve set them explicit default values. Would be pretty nice if those reflect the ones of the manufacturer.