Skip to content

Unprotected Computing : a large-scale study of DRAM raw error rate on a supercomputer

Resilience is a pressing issue to solve for extreme scale computing. DRAM errors have been analyzed in the past but little is known about errors escaping hardware checks, which lead to silent data corruption. This work attempts to fill that gap by analyzing memory errors for over a year on a cluster with about 1000 nodes featuring low-power memory without error correction. The study gathered millions of events recording thousands of errors. Temporal and spatial correlation are analyzed, but also temperature and the time of day. The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors switched from 1 to 0. In addition, we observed thousands of cases of multiple single-bit errors occurring simultaneously in different regions of the memory. We propose several directions in which the findings of this study can help the design of more reliable systems in the future.