BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20141118T200000Z DTEND:20141118T203000Z LOCATION:388-89-90 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve the performance by reducing data traffic to off-chip memory; kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are: a) Searching for the optimal kernel fusions while constrained by data dependences and kernels' precedences and, b) Effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problem sizes. The paper also introduces a codeless performance upper-bound projection to achieve effective fusions. Results show how using the proposed kernel fusion method improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x. SUMMARY:Scalable Kernel Fusion for Memory-Bound GPU Applications PRIORITY:3 END:VEVENT END:VCALENDAR