Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs
Parallel Programming Languages and Models
TimeMonday, 15 November 20212pm - 2:20pm CST
DescriptionUnderstanding how to develop efficient high-order stencils for graphics processing units (GPUs) is a topic of great interest for many application domains. High-performance stencils on GPUs must be tailored for data parallel computation and to use the memory hierarchy efficiently. For data-intensive high-order stencils, the key to high performance on GPUs is reducing the shared memory footprint to enable a large thread block for hiding memory latency. In this paper, we use the semi-stencil algorithm to do so. On the NVIDIA A100, a CUDA implementation of the semi-stencil algorithm along with other optimizations achieves a 2.13x speedup compared to an OpenACC reference implementation and is 8.7% faster than the best conventional stencil computing out of shared memory. We evaluate the performance of our implementations and their variants on the latest NVIDIA GPUs.