Carnegie Mellon University has released a study titled “A Large-Scale Study of Flash Memory Failures in the Field.” The study was conducted using Facebook’s datacenters over the course of four years and millions of operational hours. The study looks at how errors manifest and aim to help others develop novel flash reliability solutions.
Conclusions drawn from the study include:
- SSDs go through several distinct failure periods – early detection, early failure, usable life, and wearout – during their lifecycle, corresponding to the amount of data written to flash chips.
- The effect of read disturbance errors is not a predominant source of errors in the SSDs examined.
- Sparse data layout across an SSD’s physical address space (e.g., non-contiguously allocated data) leads to high SSD failure rates; dense data layout (e.g., contiguous data) can also negatively impact reliability under certain conditions, likely due to adversarial access patterns.
- Higher temperatures lead to increased failure rates, but do so most noticeably for SSDs that do not employ throttling techniques.
- The amount of data reported to be written by the system software can overstate the amount of data actually written to flash chips, due to system-level buffering and wear reduction techniques.
- A Large-Scale Study of Flash Memory Failures in the Field
- Seagate Senior Researcher: Heat Can Kill Data on Stored SSDs
- Dataliths vs. the digital dark age