BEGIN:VCALENDAR
PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN
VERSION:2.0
BEGIN:VEVENT
DTSTART:20141120T173000Z
DTEND:20141120T180000Z
LOCATION:391-92
DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Large-scale, PDE-based scientific applications are commonly parallelized across large compute resources using MPI. However, the compute power of the resource as a whole can only be utilized if each multicore node is fully utilized. Currently, many PDE solver frameworks parallelize over boxes. In the Chombo framework, the box sizes are typically 16^3, but larger box sizes such as 128^3 would result in less ghost cell overhead. Unfortunately, typical on-node parallel scaling performs quite poorly for these larger box sizes. In this paper, we investigate around 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 128^3 boxes result in close to ideal parallel scaling and come close to matching the performance of 16^3 boxes on three different multicore systems for an exemplarfor many Computational Fluid Dynamic (CFD) codes.
SUMMARY:A Study on Balancing Parallelism and Data Locality in Stencil Calculations
PRIORITY:3
END:VEVENT
END:VCALENDAR