sponsored byIEEEACMThe International Conference for High Performance 
Computing, Networking, Storage and Analysis
FacebookTwitterGoogle PlusLinkedInYouTubeFlickr

SCHEDULE: NOV 16-21, 2014

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Understanding the Effects of Communication and Coordination on Checkpointing at Scale

SESSION: Optimized Checkpointing


TIME: 1:30PM - 2:00PM

SESSION CHAIR: Patrick Bridges

AUTHOR(S):Kurt B. Ferreira, Scott Levy, Patrick M. Widener, Dorian C. Arnold, Torsten Hoefler



Fault-tolerance poses a major challenge for future large-scale systems. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated checkpointing has focused on optimizing message log volumes, local checkpointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. Lastly, we demonstrate how to tune hierarchical uncoordinated checkpointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale.

Chair/Author Details:

Patrick Bridges (Chair) - University of New Mexico

Kurt B. Ferreira - Sandia National Laboratories

Scott Levy - University of New Mexico

Patrick M. Widener - Sandia National Laboratories

Dorian C. Arnold - University of New Mexico

Torsten Hoefler - ETH Zurich

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

Paper provided by the ACM Digital Library

Paper also available from IEEE Computer Society