The International Conference for High Performance Computing, Networking, Storage and Analysis
Fault Tolerant Iterative Solvers through Selective Reliability and Skeptical Programming.
Student: James Elliott (North Carolina State University)
Advisor: Frank Mueller (North Carolina State University)
Abstract: Problem: Current and future systems may experience transient faults that silently corrupt data being operated on. These faults are troubling, because there is no indication a fault occurred and the cost to detect an error may involve performing operations multiple times and voting to determine if any values are tainted.
Approach: I focus on numerical methods and have proposed an approach called Skeptical Programming. My research couples the numerical method and the properties of data being operated on to derive cheap invariants that filter out large “damaging” errors, while allowing small “bounded” errors to slip through. The errors that slip through are easily handled by convergence theory, resulting in a low-overhead algorithm-based fault tolerance approach that exploits both numerical analysis and system-level fault tolerance. This technique scales well since we do not require additional communications, and is applicable to many methods (we present findings for the CG and GMRES solvers).