Characterizing Per-Node Memory Failures Using Benford’s Law
Reliability and Resiliency
TimeSunday, 14 November 202112:20pm - 12:30pm CST
DescriptionFault tolerance is a key challenge as high performance computing systems continue to increase component counts, individual component reliability decreases, and hardware and software complexity increases. To better understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis in an attempt to identify statistical properties of the failure data. In this paper, we examine the lifetime of failures on the Cielo supercomputer that was located at Los Alamos National Laboratory, looking specifically at the per-node time between faults. Through this analysis, we show that the time between correctable faults on nodes obeys Benford’s law, This law applies to a number of naturally occurring collections of numbers and states that the leading digit is more likely to be small, for example a leading digit of 1 is more likely than 9. This is in contrast to previous work that examined the interarrival time for correctable faults on the entire machine, which do not obey Benford’s law. This initial work provides critical analysis on the distribution of times between failures for extreme-scale systems. More specifically, the distributed analysis technique outlined in this work has the potential for significantly lower overheads than the centralized approach described in previous work. Also, this work enables a simple form of distributed failure prediction that can be utilized to lower failure mitigation overheads.