sponsored byIEEEACMThe International Conference for High Performance 
Computing, Networking, Storage and Analysis
FacebookTwitterGoogle PlusLinkedInYouTubeFlickr

SCHEDULE: NOV 16-21, 2014

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory

SESSION: Hardware Vulnerability and Recovery

EVENT TYPE: Papers

TIME: 2:30PM - 3:00PM

SESSION CHAIR: Alison Kennedy

AUTHOR(S):Sarah E. Michalak, William N. Rust, John T. Daly, Andrew J. DuBois, David H. DuBois

ROOM:393-94-95

ABSTRACT:

Silent Data Corruption (SDC) can threaten the integrity of
scientific calculations performed on high performance computing
(HPC) platforms and other systems. To characterize this issue,
correctness field testing of HPC platforms at Los Alamos National
Laboratory was performed. This work presents results for 12
platforms, including over 1,000 node-years of computation performed
on over 8,750 compute nodes and over 260 PB of data transfers
involving nearly 6,000 compute nodes, and relevant lessons learned.
Incorrect results characteristic of transient errors and of
intermittent errors were observed. These results are a key
underpinning to resilience efforts as they provide signatures of
incorrect results observed under field conditions. Five incorrect
results consistent with a transient error mechanism were observed,
suggesting that the effects of transient errors could be
mitigated. However, the observed numbers of incorrect results consistent with
an intermittent error mechanism suggest that intermittent errors could
substantially effect computational correctness.

Chair/Author Details:

Alison Kennedy (Chair) - Edinburgh Parallel Computing Centre

Sarah E. Michalak - Los Alamos National Laboratory

William N. Rust - Los Alamos National Laboratory

John T. Daly - Laboratory for Physical Sciences

Andrew J. DuBois - Los Alamos National Laboratory

David H. DuBois - Los Alamos National Laboratory

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar


Paper provided by the ACM Digital Library

Paper also available from IEEE Computer Society