NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

SESSION: Machine Learning and Data Analytics


TIME: 10:30AM - 11:00AM


AUTHOR(S):Zhengzhang Chen, Seung Woo Son, William Hendrix, Ankit Agrawal, Wei-keng Liao, Alok Choudhary



Data checkpointing is an important fault tolerance technique in High Performance Computing systems. This paper exploits the fact that in many scientific applications, relative change in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing data itself allows us to incorporate the temporal dimension of the data, and learn evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable with a user-defined and guaranteed error bounds for each data point. We propose NUMARCK, NU Machine learning Algorithm for Resiliency and ChecKpointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations, and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance.

Chair/Author Details:

Hank Childs (Chair) - University of Oregon and Lawrence Berkeley National Laboratory

Zhengzhang Chen - Northwestern University

Seung Woo Son - Northwestern University

William Hendrix - Northwestern University

Ankit Agrawal - Northwestern University

Wei-keng Liao - Northwestern University

Alok Choudhary - Northwestern University

