Physical Disk health - HDD/SSD

marbaa · December 14, 2021, 1:29pm

Hi,

I’m monitoring hundreds of DELL servers, DSS9600, R740XD2, etc. I’m interested in monitoring of health of disks, either HDD or SSD. Servers are mostly using MegaRAID controller. There are some official plugins for MegaRAID stuff. The check for BBU works perfectly.

But the check megaraid_pdisks is useless imo. Looking on the source code of plugin, it doesn’t send any info about state of disks. It is just informative check. Example from our server

Each server have 24x HDD and they are failing like apples in our environment and the only status which this check produces is OK or UNKN when the disk is unplugged from disk slot. It doesn’t look into SMART, or something other.

So, how is anybody monitoring the health of disks in production servers which are connected to RAID controller like MegaRAID, Adaptec, etc?

tosch · December 14, 2021, 1:51pm

Hi @marbaa,

maybe you take a look to IPMI monitoring for the Dell/EMC R740.
In my standard SNMP monitoring of the management board i get the disk state much more detailed for the servers:

BeatONE · December 14, 2021, 2:39pm

Hi,
I monitor the hard drives of my Dell servers via IDrac using SNMP.
This works without any problems and also shows everything you want to see.

Wouldn’t that be an option for you?

andreas-doehler · December 14, 2021, 2:50pm

This would be the best option as also @tosch mentioned in his post. I think here is another problem. Inside the original post ist stated

The original Dell controllers are not MegaRAID or? If these are third party controller then you don’t see it over iDRAC.

BeatONE · December 14, 2021, 3:47pm

Right… I hadn’t thought about that. Dell RAID controllers are the PERC
PowerEdge RAID Controller…

marbaa · December 14, 2021, 4:18pm

In our environment is SNMP not used, definitely I need to get allowed UDP161,162 ports on firewalls. I don’t have general experience with SNMP at all, is it reliable?
Is bi-directional connection needed for SNMP? Or it is enough onway connection from checmk → server?

Well, I know that PERC are DELL controllers. We have some servers with PERC730P, H840. I don’t know if MegaRAID is original DELL or not. Example of lshw command output:

 *-raid
             description: RAID bus controller
             product: MegaRAID SAS-3 3108 [Invader]
             vendor: Broadcom / LSI
             physical id: 0
             bus info: pci@0000:d8:00.0
             logical name: scsi15
             version: 02
             width: 64 bits
             clock: 33MHz
             capabilities: raid pm pciexpress msi msix bus_master cap_list
             configuration: driver=megaraid_sas latency=0
             resources: irq:130 ioport:e000(size=256) memory:ee900000-ee90ffff memory:ee800000-ee8fffff
           *-disk:0
                description: SCSI Disk
                product: PERC H730P Adp
                vendor: DELL
                physical id: 2.0.0
                bus info: scsi@15:2.0.0
                logical name: /dev/sda
                version: 4.30
                serial: 
                size: 931GiB (999GB)
...
           *-disk:2
                description: SCSI Disk
                product: PERC H730P Adp
                vendor: DELL
                physical id: 2.a.0
                bus info: scsi@15:2.10.0
                logical name: /dev/sdk
                version: 4.30
                serial: 
                size: 3725GiB (4TB)
                configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
           *-disk:3
                description: SCSI Disk
                product: PERC H730P Adp
                vendor: DELL
                physical id: 2.b.0
                bus info: scsi@15:2.11.0
                logical name: /dev/sdl
                version: 4.30
                serial: 
                size: 3725GiB (4TB)
                configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512

Other example:

        *-raid
             description: RAID bus controller
             product: MegaRAID SAS-3 3108 [Invader]
             vendor: LSI Logic / Symbios Logic
             physical id: 0
             bus info: pci@0000:5e:00.0
             logical name: scsi0
             version: 02
             width: 64 bits
             clock: 33MHz
             capabilities: raid pm pciexpress vpd msi msix bus_master cap_list
             configuration: driver=megaraid_sas latency=0
             resources: irq:32 ioport:8000(size=256) memory:b8900000-b890ffff memory:b8800000-b88fffff
           *-enclosure:0 UNCLAIMED
                description: SCSI Enclosure
                product: G5
                vendor: DELL
                physical id: 0.8.0
                bus info: scsi@0:0.8.0
                version: 0e00
                configuration: ansiversion=5
           *-disk:0
                description: SCSI Disk
                product: ST4000NM0135
                vendor: SEAGATE
                physical id: 0.9.0
                bus info: scsi@0:0.9.0
                logical name: /dev/sda
                version: DSF2
                serial: 
                size: 3726GiB (4TB)
                capabilities: 7200rpm
                configuration: ansiversion=6 logicalsectorsize=512 sectorsize=512
           *-disk:1
                description: SCSI Disk
                product: ST4000NM0135
                vendor: SEAGATE
                physical id: 0.12.0
                bus info: scsi@0:0.18.0
                logical name: /dev/sdj
                version: DSF2
                serial: 
                size: 3726GiB (4TB)
                capabilities: 7200rpm
                configuration: ansiversion=6 logicalsectorsize=512 sectorsize=512

MatthewStier · December 14, 2021, 10:24pm

I have my own scripts to build smartctl config files, and my own smartctl Check_MK plugin.

andreas-doehler · December 15, 2021, 6:37am

The output of the lshw command is not so helpful in this case. Booth controllers using the same chipset and different firmware versions. You can check if an lspci gives other information. If these are normal Dell controllers (what i think as they are using LSI chipsets also) then i would not mess around with megariad or smartctl checks. Only use the iDRAC nothing else. This is the only reliable information source.

That would be very strange. SNMP is used in many ways not visible to the outside.

If your monitoring server stands inside the out of bound management network for your iDRAC/iLO controllers then there should be no firewall involved.

I depends on the definition of “reliable”. It is the default/standard network management protocol since over 20 years now.

marbaa · December 15, 2021, 7:18am

Our environment is pretty complex. Main instance in Germany, then few satellites around country. And monitored servers are in USA, UAE, Czech, Finland, etc over MPLS every location in different network, inside VPN tunnels, some server so stuck on prem that iptables dnat is done via two servers to get network from servers to its corresponding satellite. So a lot of firewalls in the way.

Let me check lspci output, but I barely remember that even PERC’s are showing as MegaRAID card, but I will check.

edit:
Output from one server, two controllers inside:

5e:00.0 RAID bus controller: Broadcom / LSI MegaRAID Tri-Mode SAS3508 (rev 01)
        Subsystem: Dell PERC H840 Adapter
d8:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
        Subsystem: Dell PERC H730P Adapter

Other server:

5e:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
        Subsystem: LSI Logic / Symbios Logic Device 9381

marbaa · January 3, 2022, 9:09am

I made some progress, it is needed to allow both directions on firewalls. I also find it out somewhere on docu page.
Next question, is it possible to combine output of both data sources into one host in checkmk? Example:

Checkmk host named “test_server” → this has standardservice monitoring via check_mk_agent.
Checkmk host named “test_server_hw” → this has services via SNMP.

I would like to avoid having “duplicated” hosts in checkmk.
No idea for what is additional IPv4 field, just testing it right now:

marbaa · January 3, 2022, 10:22am

I got it. Just add regular host into checkmk to check via check_mk_agent and at the bottom of WATO properties page of host add iDRAC IP and credentials. Now I have combined output with services and all hardware checks. So simple.

system · January 3, 2023, 10:23am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.