The International Conference for High Performance Computing, Networking, Storage and Analysis
Introspective Resilience for Exascale High Performance Computing Systems.
Student: Saurabh Hukerikar (University of Southern California)
Advisor: Robert F. Lucas (University of Southern California)
Abstract: Future exascale HPC systems will be constructed using hundreds of millions of components organized in complex hierarchies to satiate the demand for faster and more accurate scientific computations. However, the sheer scale and inherent unreliability of the VLSI chips implies that faults and errors will affect HPC applications with increasing frequency, making it increasingly difficult to accomplish useful computation. In this dissertation work, we propose an introspective approach to managing the resilience of HPC applications. Through a set of modest extensions to current programming models we capture the programmer's insight into the application's fault tolerance features. An introspective runtime framework reasons about the rate and sources of faults and errors in the system and attempts to understand their impact on application correctness. This enables collaborative cross-layer efforts for error detection and recovery. Our preliminary results demonstrate much promise to meet the reliability demands and expectations of future exascale platforms.