The International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable Fault Tolerance in Multiprocessor Systems.
Student: Gagan Gupta (University of Wisconsin-Madison)
Supervisor: Gurindar S. Sohi (University of Wisconsin-Madison)
Abstract: Evolving trends in design and use of computers are resulting in fault-prone systems which may not execute a program to completion. Checkpoint-and-recovery (CPR) is commonly used to recover from faults and complete parallel programs. Conventional CPR incurs high overheads and may be inadequate in the future as faults become frequent. This work proposes to execute parallel programs deterministically to enable lower overhead and scalable fault tolerance.