Plugin SMART: Service-State flapping for SSDs due to Pending / Reallocated Sectors

CMK version: Checkmk Raw Edition 2.1.0p16

Plugin version: Plugin “smart” 2.1.0p16

OS version: Debian 11 “Bullseye” 5.10.0-23-amd64 SMP Debian 5.10.179-1 (2023-05-12) x86_64

This post was originally posted in german CMK forum: https://forum.checkmk.com/t/plugin-smart-service-state-flappt-aufgrund-pending-reallocated-sectors-bei-ssd

Dear Checkmk Community,

for monitoring the SSDs of our Debian servers, we use the Checkmk plugin “SMART”. Generally speaking, this works well and we get the SMART data of the SSDs shown up in Checkmk monitoring.

Nevertheless we have issues with the service state, sometimes flapping from OK to CRIT due to “Pending Sectors” or “Reallocated Sectors”. This is caused by e.g. “Pending Sectors: 1” being greater “Pending Sectors: 0” during discovery of the service and thus this is show as CRIT state in monitoring.

Occasionally there are more than 1 sectors pending, but the SSD ist not damaged. In the next monitoring cycle, the pending or reallocated sectors go back to 0 and thus service state is OK again.

What we have tried without any success:

  • It’s not possible to set any different thresholds for sectors to be WARN/CRIT by Checkmk-rules. To do so, probably the plugin has to be enhanced by such a feature.

  • We can bring the service state to a soft-CRIT state, but just for notifications and not for the event history dashboard, so the flapping service spams our event history.

Questions:

  • Do we have a wrong understanding of the SMART plugin? Is it intended for our use case?

  • What else can we do to quiet the flapping service state?

  • Is there any possibility to get the SMART parameters alternatively? (SNMP polling is no option for us due to design restrictions)

  • How do you use the SMART plugin?

  • How do you monitor SSD wear?

Thanks for your help!

There was a reply to this question in the German forum, saying that the plugin respectively the check is not working correctly and will never show any useful output.
Using the output of smartctl is quite difficult (to parse) and doesn’t make any sense for SSDs.

Does anybody of you have a similar opinion?

How do you monitor the wear of SSDs with Checkmk?

Thanks for your answers :slight_smile: