Scalable Fault Tolerance in Multiprocessor Systems

SESSION: ACM Student Research Competition Poster Reception

EVENT TYPE: ACM Student Research Competition

TIME: 5:15PM - 7:00PM

AUTHOR(S):Gagan Gupta

ROOM:New Orleans Theater Lobby


Evolving trends in design and use of computers are resulting in fault-prone systems which may not execute a program to completion. Checkpoint-and-recovery (CPR) is commonly used to recover from faults and complete parallel programs. Conventional CPR incurs high overheads and may be inadequate in the future as faults become frequent. This work proposes to execute parallel programs deterministically to enable lower overhead and scalable fault tolerance.

Gagan Gupta - University of Wisconsin-Madison

