The International Conference for High Performance Computing, Networking, Storage and Analysis
Fault Injection, Detection, and Correction in CLAMR Using F-SEFI.
Authors: Brian Atkinson (Clemson University), Nathan DeBardeleben (Los Alamos National Laboratory), Qiang Guan (Los Alamos National Laboratory), William M. Jones (Coastal Carolina University)
Abstract: F-SEFI is a fine-grained software-based soft fault injection tool developed at LANL. We used F-SEFI to study the resilience of the scientific application to CLAMR, a cell based adaptive mesh refinement hydrodynamic code also developed at LANL, in the presence of soft errors. CLAMR models a cylindrical shock generated in the center of the mesh that reflects off the boundaries. We focused our fault injections on the floating point add operations in the exponent bit field. Using conservation of mass calculations inherent to the shallow water simulations, we specified an acceptable bound for the mass percentage difference between specified time steps. We built a checkpointing and rollback mechanisms into CLAMR to save and restore state and mesh values from backup files. Using the checkpointing and roll back routines, we were able to recover from 81% of soft errors that would have caused incorrect results or the application to crash.