Abstract: We present a suite of directives, named pipelined accelerator (PACC),
and its implementation for accelerating large-scale computation on a
graphics processing unit (GPU). PACC extends OpenACC to achieve
division of large data that cannot be entirely stored in device
memory. Given a program with PACC directives, our PACC translator
rewrites the program into an OpenACC program such that data is
divided into multiple chunks for accelerated
execution. Furthermore, the generated program processes chunks in a
pipeline so that data transfer between the CPU and GPU can overlap
with computation on the GPU. Some preliminary results are also
presented to show the impact of PACC in terms of the program
execution time and the maximum data size that can be processed
successfully.