BEGIN:VCALENDAR
PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN
VERSION:2.0
BEGIN:VEVENT
DTSTART:20141117T143000Z
DTEND:20141117T230000Z
LOCATION:388
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Resilience is a critical issue for large-scale platforms. This tutorial=0A  provides a comprehensive survey of fault-tolerant techniques for=0A  high-performance computing, with a fair balance between practice and theory.=0A  It is organized along four main topics:=0A (i) An overview of failure types (software/hardware, transient/fail-stop), and=0A  typical probability distributions (Exponential, Weibull, Log-Normal);=0A  (ii) General-purpose techniques, which include several checkpoint and=0A  rollback recovery protocols, replication, prediction and silent error=0A  detection;=0A (iii) Application-specific techniques, such as ABFT for grid-based algorithms=0A  or fixed-point convergence for iterative applications; and=0A (iv) Practical deployment of fault tolerant techniques with User Level Fault=0A  Mitigation (a proposed MPI extension to the MPI forum). Relevant examples=0A  based on widespread computational solver routines will be protected with a mix=0A  of checkpoint-restart and advanced recovery techniques in a hands-on session.=0A=0A  The tutorial is open to all SC'14 attendees who are interested in the current=0A  status and expected promise of fault-tolerant approaches for scientific=0A  applications. There are no audience prerequisites: background will be provided=0A  for all protocols and probabilistic models.  However, basic knowledge of MPI=0A  will be helpful for the hands-on session.
SUMMARY:Fault-Tolerance for HPC: Theory and Practice
PRIORITY:3
END:VEVENT
END:VCALENDAR