sponsored byIEEEACMThe International Conference for High Performance 
Computing, Networking, Storage and Analysis
SCHEDULE: NOV 16-21, 2014

Fault Injection, Detection, and Correction in CLAMR Using F-SEFI

SESSION: Poster Reception


TIME: 5:15PM - 7:00PM

AUTHOR(S):Brian Atkinson, Nathan DeBardeleben, Qiang Guan, William M. Jones

ROOM:New Orleans Theater Lobby


F-SEFI is a fine-grained software-based soft fault injection tool developed at LANL. We used F-SEFI to study the resilience of the scientific application to CLAMR, a cell based adaptive mesh refinement hydrodynamic code also developed at LANL, in the presence of soft errors. CLAMR models a cylindrical shock generated in the center of the mesh that reflects off the boundaries. We focused our fault injections on the floating point add operations in the exponent bit field. Using conservation of mass calculations inherent to the shallow water simulations, we specified an acceptable bound for the mass percentage difference between specified time steps. We built a checkpointing and rollback mechanisms into CLAMR to save and restore state and mesh values from backup files. Using the checkpointing and roll back routines, we were able to recover from 81% of soft errors that would have caused incorrect results or the application to crash.

Chair/Author Details:

Brian Atkinson - Clemson University

Nathan DeBardeleben - Los Alamos National Laboratory

Qiang Guan - Los Alamos National Laboratory

William M. Jones - Coastal Carolina University

