The International Conference for High Performance Computing, Networking, Storage and Analysis
Using Global View Resilience (GVR) to add Resilience to Exascale Applications.
Authors: Hajime Fujita (University of Chicago and Argonne National Laboratory), Nan Dun (University of Chicago and Argonne National Laboratory), Aiman Fang (University of Chicago), Zachary A. Rubenstein (University of Chicago), Ziming Zheng (HP Vertica), Kamil Iskra (Argonne National Laboratory), Jeff Hammond (Intel Corporation), Anshu Dubey (Lawrence Berkeley National Laboratory), Pavan Balaji (Argonne National Laboratory), Andrew A. Chien (University of Chicago and Argonne National Laboratory)
Best Poster Finalist
Abstract: Resilience is a big challenge in future exascale machines.
Existing approaches are unlikely to address complex failures like latent errors, therefore we need a new approach.
We propose Global View Resilience (GVR), a new library that exploits a global view data model and adds reliability through versioning (multi-version), user control timing and rate (multi-stream), and flexible cross layer error signalling and recovery. GVR enables application programmers to exploit deep scientific and application code insights to manage resilience (and its overhead) in a flexible, portable fashion.
We applied the GVR library to several existing scientific application codes and showed that GVR can be easily applied and runtime overhead for versioning is negligible.