Analysis and modeling of memory errors from large-scale field data collection

作者: Taniya Siddiqua , Athanasios E Papathanasiou , Arijit Biswas , Sudhanva Gurumurthi

DOI:

关键词:

摘要: Main memory reliability plays a crucial role in overall system reliability. Unfortunately, our collective understanding of the rate, pattern, and impact of memory errors is inadequate and can hinder our ability to innovate new fault-tolerant designs. This paper presents an in-depth study of observed corrected error data from the main memory system of a large server population deployed in data centers. Our analysis includes multiple structures on the memory path, such as the memory controllers, busses, channels, and memory modules. Based on our observations, we present a taxonomy of potential faults in the memory path. We provide a detailed characterization of the faults and present novel insights into the nature of these faults and the errors that they induce.

参考文章(0)