Granular configuration of UCS Blademanager "Fault Instances Blade" Monitoring

TSherman · June 25, 2024, 3:24pm

Hey all!

I’m working to set up monitoring of a UCS Blade environment and I’m running into trouble with the number of faults that stack up under the “Fault Instances Blade” service. I’m hoping there may be a way to clean this up so it’s more usable.

Since Cisco follows the mantra of not clearing a fault unless it’s actually fixed/resolved, we’ve got numerous faults that we don’t need notification or alerting about in our environment, primarily on powered-off blades. I’d like to be able to break apart the “Fault Instances Blade” service to monitor for the specific condition criteria provided by UCS Blademanager OR break it apart based on a regex within the larger collection of faults (ex. CMOS faults, local disk missing faults, etc.).

Has anyone accomplished anything like this, or found a way to make the displaying of alerts from Blademanager much cleaner?

Thanks in advance!

jestertoo · October 31, 2024, 1:36pm

I am also interested in this.

TSherman · October 31, 2024, 2:07pm

So I’ve slowly made some progress in figuring out how to make this happen, but I’m only maybe 10% of the way there.

The script that’s run to do these checks is located in /opt/omd/versions/(version number)/share/check_mk/checks/ucs_bladecenter_faultinst. I dropped a copy of the script (per a recommendation elsewhere) into /omd/sites/(site name)/local/share/check_mk/checks/ so I could poke around with it without an upgrade overwriting the work. I was able to toy with what it was reporting based on severity levels (which are read from the UCS Bladecenter Fault instances rule), but parsing the data to be either on a per-line basis or split into separate services is still being worked.

Maybe someone else with some more Python experience can look closer at it and bring some insight into how to make this work.

jestertoo · October 31, 2024, 2:44pm

Ok, doing that and adjusting the
if sev_state >= 0:
to 3 (critical) at least lessens the spam coming from there. It still summarizes the lesser alerts.