sponsored byIEEEACMThe International Conference for High Performance 
Computing, Networking, Storage and Analysis
FacebookTwitterGoogle PlusLinkedInYouTubeFlickr

SCHEDULE: NOV 16-21, 2014

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

A System Software Approach to Proactive Memory-Error Avoidance

SESSION: Resilience


TIME: 4:00PM - 4:30PM

SESSION CHAIR: Ananta Tiwari

AUTHOR(S):Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, Kyung Dong Ryu



Today's HPC systems use two mechanisms to address main-memory errors.
Error-correcting codes make correctable errors transparent to software,
while checkpoint/restart (CR) enables recovery from uncorrectable errors.
Unfortunately, CR overhead will be enormous at exascale due to the high
failure rate of memory. We propose a new OS-based approach that proactively
avoids memory errors using prediction. This scheme exposes correctable
error information to the OS, which migrates pages and offlines unhealthy
memory to avoid application crashes. We analyze memory error patterns in
extensive logs from a BG/P system and show how correctable error patterns
can be used to identify memory likely to fail. We implement a proactive
memory management system on BG/Q by extending the firmware and Linux. We
evaluate our approach with a realistic workload and compare our overhead
against CR. We show improved resilience with negligible performance
overhead for applications.

Chair/Author Details:

Ananta Tiwari (Chair) - PMaC Lab, SDSC

Carlos H. A. Costa - IBM Corporation

Yoonho Park - IBM Corporation

Bryan S. Rosenburg - IBM Corporation

Chen-Yong Cher - IBM Corporation

Kyung Dong Ryu - IBM Corporation

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

Paper provided by the ACM Digital Library

Paper also available from IEEE Computer Society