The International Conference for High Performance Computing, Networking, Storage and Analysis
Toward Effective Detection of Silent Data Corruptions for HPC Applications.
Authors: Sheng Di (Argonne National Laboratory), Eduardo Berrocal (Illinois Institute of Technology), Leonardo Bautista-Gomez (Argonne National Laboratory), Katherine Heisey (Argonne National Laboratory), Rinku Gupta (Argonne National Laboratory), Franck Cappello (Argonne National Laboratory)
Abstract: Because of the large number of components, future extreme-scale systems are expected to suffer a lot of silent data corruptions. Changes caused by silent errors flipping low-order bit positions are very small, making them difficult to detect by software. In this work, we convert the detection problem to a one-step look-ahead prediction issue and explore the most effective prediction methods for different HPC applications. We exploit the Auto Regressive (AR) model, Auto Regressive Moving Average (ARMA) Model, Linear Curve Fitting (LCF), and Quadratic Curve Fitting (QCF). We evaluate them using real HPC application traces. Experiments show that the error feed-back control plays an important role in improving detection. AR and QCF perform the best among all evaluated methods, where F-measure can be kept around 80% for silent bit-flip errors occurring around the bit position 20 for double-precision data or around bit 8 for single-precision data.