I’m monitoring hundreds of DELL servers, DSS9600, R740XD2, etc. I’m interested in monitoring of health of disks, either HDD or SSD. Servers are mostly using MegaRAID controller. There are some official plugins for MegaRAID stuff. The check for BBU works perfectly.
But the check megaraid_pdisks is useless imo. Looking on the source code of plugin, it doesn’t send any info about state of disks. It is just informative check. Example from our server
Each server have 24x HDD and they are failing like apples in our environment and the only status which this check produces is OK or UNKN when the disk is unplugged from disk slot. It doesn’t look into SMART, or something other.
So, how is anybody monitoring the health of disks in production servers which are connected to RAID controller like MegaRAID, Adaptec, etc?
maybe you take a look to IPMI monitoring for the Dell/EMC R740.
In my standard SNMP monitoring of the management board i get the disk state much more detailed for the servers:
In our environment is SNMP not used, definitely I need to get allowed UDP161,162 ports on firewalls. I don’t have general experience with SNMP at all, is it reliable?
Is bi-directional connection needed for SNMP? Or it is enough onway connection from checmk → server?
Well, I know that PERC are DELL controllers. We have some servers with PERC730P, H840. I don’t know if MegaRAID is original DELL or not. Example of lshw command output:
The output of the lshw command is not so helpful in this case. Booth controllers using the same chipset and different firmware versions. You can check if an lspci gives other information. If these are normal Dell controllers (what i think as they are using LSI chipsets also) then i would not mess around with megariad or smartctl checks. Only use the iDRAC nothing else. This is the only reliable information source.
That would be very strange. SNMP is used in many ways not visible to the outside.
If your monitoring server stands inside the out of bound management network for your iDRAC/iLO controllers then there should be no firewall involved.
I depends on the definition of “reliable”. It is the default/standard network management protocol since over 20 years now.
Our environment is pretty complex. Main instance in Germany, then few satellites around country. And monitored servers are in USA, UAE, Czech, Finland, etc over MPLS every location in different network, inside VPN tunnels, some server so stuck on prem that iptables dnat is done via two servers to get network from servers to its corresponding satellite. So a lot of firewalls in the way.
Let me check lspci output, but I barely remember that even PERC’s are showing as MegaRAID card, but I will check.
edit:
Output from one server, two controllers inside:
5e:00.0 RAID bus controller: Broadcom / LSI MegaRAID Tri-Mode SAS3508 (rev 01)
Subsystem: Dell PERC H840 Adapter
d8:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
Subsystem: Dell PERC H730P Adapter
I made some progress, it is needed to allow both directions on firewalls. I also find it out somewhere on docu page.
Next question, is it possible to combine output of both data sources into one host in checkmk? Example:
Checkmk host named “test_server” → this has standardservice monitoring via check_mk_agent.
Checkmk host named “test_server_hw” → this has services via SNMP.
I would like to avoid having “duplicated” hosts in checkmk.
No idea for what is additional IPv4 field, just testing it right now:
I got it. Just add regular host into checkmk to check via check_mk_agent and at the bottom of WATO properties page of host add iDRAC IP and credentials. Now I have combined output with services and all hardware checks. So simple.
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed. Contact an admin if you think this should be re-opened.