Tuesday, July 14, 2015

A Large-Scale Study of Flash Memory Failures in the Field

A Large-Scale Study of Flash Memory Failures in the Field. Justin Meza, et al. ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. June 15-19, 2015.
     Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. "Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability." This can lead to data loss.

This is the first large-scale study of actual flash-based SSD reliability and it analyzes data from flash-based solid state drives at Facebook data centers for about four years and millions of operational hours in order to understand the failure properties and trends. The major observations:
  1. SSD failure rates do not increase monotonically with flash chip wear, but go through several distinct periods corresponding to how failures emerge and are subsequently detected, 
  2. the effects of read disturbance errors are not prevalent in the field, 
  3. sparse logical data layout across an SSD's physical address space can greatly affect SSD failure rate, 
  4. higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and 
  5. data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells
The findings will hopefully lead to other analyses and flash reliability solutions.

No comments: