SSDs and SMART Data
SMART stands for Self-Monitoring, Analysis and Reporting Technology. As the name suggests, it is a tool that records the health data of a hard drive disk (HDD) or a solid state drive (SSD). SMART data can be a valuable tool, providing an early warning if a drive is having problems or has reached the end of its useful life, leading to the possibility of replacement before a failure.
Easy reporting to a user or system administrator is important because the drive itself is not capable of an in-depth analysis of the data. The drive can only report if certain attributes have passed over pre-determined thresholds, and then only if thresholds have been programmed in firmware.
SMART has been around for many years and predates the birth of the SSD. SMART was originally implemented on HDDs and was adapted for SSDs when this new technology was invented as a drop-in replacement for spinning hard drives. Unfortunately, regardless of the storage technology, there is no industry-wide standard to tell you which numbered SMART attribute describes a specific physical property of a drive. The attribute descriptors will vary between SSD and HDD, and even between the various SSD vendors.
There are several third-party utilities that can retrieve and report a drive’s SMART data, often available in freeware and shareware. However, unless the third-party software vendor consults with the drive vendor on correct SMART attributes, their definitions and thresholds (when applicable) are likely to mislabel the attributes and can lead to false-positive or false-negative failure reports.
We have seen examples where the “Power-on Hours Count” attribute reports a number to the utility, and the incompatible utility may incorrectly label that number “Program Fail Count” or “Reported Uncorrectable Errors.” Worse, the third-party utility may have failure thresholds which are inappropriate for the SSD in question, so the SMART utility reports a failure at a point the manufacturer knows to be acceptable operation.
Because of this potential confusion, Crucial recommends using only our Storage Executive software as a tool to accurately retrieve and analyze SMART data on Crucial SSDs. Storage Executive will always be programmed with the correct attribute descriptions and thresholds (when applicable) for all but the oldest legacy Crucial SSDs.
Crucial-defined SMART attributes described
Crucial SSDs record several different attributes for retrieval by Storage Executive. Some attributes report critical information about the SSD, while others are informational only.
We’ll talk about some of the more important ones here, when the attribute is named differently for SATA and PCle, both names will be presented:
Attribute 202: Percentage Lifetime Used
This attribute is exactly as its name implies. It is a measure of how much of the drive’s projected lifetime has been used at any point in time. When the SSD is brand new, Attribute 202 will report “0”, and when its specified lifetime has been reached, it will show “100,” reporting that 100 percent of the lifetime has been used.
However, it’s important to realize what it means to reach 100 percent of projected lifetime – it does not mean that the drive is going to fail when that counter rolls over to 101 percent, only that your SSD may need to be replaced soon.
The lifetime of a NAND-flash device is defined by another characteristic: data retention. Data retention is the amount of time that the device can safely store and allow successful retrieval of user data in an unpowered state. When an SSD or other NAND-flash device is brand new, its unpowered data retention will be several years. However, almost like human memory, it gets shorter as it experiences wear and tear, by writing data (data reads do not directly cause wear).
The Joint Electron Device Engineering Council (JEDEC) is the industry group which creates standards and specifications for semiconductor-based devices and assemblies. Micron is a leading member of JEDEC, which defines data retention in a specific way: For SSDs in client applications (like business or personal computers), data retention for an SSD shall be one year, in an unpowered state, stored at 30 °C (86 °F). This should give most computer users plenty of time to retrieve any data from an unused drive after some time on the shelf, if needed.
You might be able to tell from this description that the SSD can be expected to work quite well as the lifetime counter proceeds up from 100 percent. However, as time carries on, data retention will continue to degrade, down from one year to six months to three months, and so on. Eventually, long beyond the warrantied life of the drive, any new write will not be retainable at all while unpowered.
However, SSD firmware takes this into account. As the SSD continues to age, error correction code (ECC), read retries, adaptive read parameters, background data maintenance, and other adjustments in firmware can correct problems that arise because of gradually degrading data retention. As NAND data blocks degrade, they can be replaced by on-board spares, and normal operations can proceed. Of course, all these background operations take place when power is on, which is why data retention is defined in an unpowered state.
Attribute 5: Retired NAND Blocks
The number of blocks retired through this process of continually evaluating the quality of NAND blocks is tracked in SMART Attribute 5. SSD firmware will retire NAND blocks for several reasons in addition to the wear and data retention issue described above. One reason for retirement is a failure to erase a block while deleting data or moving data during garbage collection. This type of failure causes a low risk to user data since the data in question is being deleted or has already been copied successfully to a new location on the SSD.
Attribute 180: Unused Reserved Block Count (Available Spare Blocks on PCIe SSDs)
Again, as the name implies, this is the count of extra blocks available to be used in case bad blocks need to be retired. This number varies based on the underlying NAND architecture, the firmware architecture, and the user capacity of the drive, but it usually starts in the thousands.
This number decreases as the number of retired blocks increases. When Attribute 180 reaches 0, the firmware will place the SSD in read-only mode. The SSD will not be usable as a normal drive, but the user should be able to retrieve stored data and transfer to a new device. It is strongly advised that if this number should get below 100 or so, the drive should be replaced.
Attribute 210: RAIN Successful Recovery Page Count
Redundant Array of Independent NAND (RAIN) is very analogous to gaining data redundancy using RAID in a drive array. However, RAIN redundancy is accomplished within the drive, transparently to the user. RAIN is a feature which the SSD uses to protect user data and to extend the lifetime of the drive.
RAIN events are rare, so if this count gets large, it’s time to examine some of the attributes above and see if the drive needs to be replaced. Frequent RAIN events can cause a noticeable decrease in performance. Using parity redundancy to recover data allows normal drive operations to continue, but consumes some I/O bandwidth. If performance decreases are frequent, RAIN rebuilds could be the reason, and could be cause for concern.
Attribute 174: Unexpected Power Loss Count (Unsafe Shutdowns Count on PCIe SSDs)
A normal power loss in a computer system is preceded by a message from the host computer to the SSD that power is about to go down. This warning gives the SSD time to complete any ongoing activity. Once complete, the SSD sends an “acknowledge” message to the host, and the host finishes the shutdown.
There are many situations when the power goes down unexpectedly, and these can create problems for the SSD. In almost all conditions, the SSD can compensate for this, although the next boot-up time may be a bit longer (a few seconds, instead of hundreds of milliseconds), but the system will start again.
Attribute 174 is usually informational only. However, a large number of such events may indicate that a user needs to be trained on proper operating systems shutdowns, or that there could be a problem with power supplies or connections.
Attribute 194: Enclosure Temperature (Device Temperature on PCIe)
The Crucial Storage Executive software will report both the current temperature and the highest lifetime temperature, in Celsius, measured by a sensor on the SSD. The specified operating range for most Crucial SSDs is 0 °C to 70 °C (or 32 °F to 158 °F). Any temperature recorded above 70 °C could void the product warranty, so temperature should be monitored periodically. If temperatures regularly run above 65 °C, corrective measures such as improved ventilation and fans are advisable.
Some closing thoughts
SMART can be a very useful tool for monitoring the health of your SSD. However, SMART is not a comprehensive diagnostic tool. Information gleaned from SMART attributes along with operating system diagnostics can provide a good starting place for standard troubleshooting practices.
Incorrectly reported or misinterpreted SMART data can lead to incorrect conclusions which, can, unfortunately, lead to the return of a perfectly functional drive. Therefore, it bears repeating that Crucial strongly suggests that only the Crucial Storage Executive software be used to read SMART data from Crucial SSDs.
©2019 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject to change without notice. Neither Crucial nor Micron Technology, Inc. is responsible for omissions or errors in typography or photography. Micron, the Micron logo, Crucial, and the Crucial logo are trademarks or registered trademarks of Micron Technology, Inc. PCI Express and PCIe are registered trademarks of PCI-SIG. All other trademarks and service marks are the property of their respective owners.