Memory Optimizations for Sparse Linear Algebra on GPU Hardware

SC21 Proceedings

Memory Optimizations for Sparse Linear Algebra on GPU Hardware

Workshop:MCHPC’21: Workshop on Memory Centric High Performance Computing

Authors: Aaron Walden (NASA Langley Research Center), Mohammad Zubair (Old Dominion University), Christopher Stone (National Institute of Aerospace), and Eric Nielsen (NASA Langley Research Center)

Abstract: An effort to maximize memory bandwidth utilization for a sparse linear algebra kernel executing on NVIDIA Tesla V100 and A100 Graphics Processing Units (GPUs) is described. The kernel consists of a block-sparse matrix-vector product and a series of forward/backward triangular solves. The computation is memory-bound and exhibits low arithmetic intensity. An earlier implementation yield good memory performance on the V100 architecture. However, a new approach, which assigns a warp to six rows of the matrix, is proposed for the A100. In addition, two new features offered by the A100 architecture are explored. L2 residency control enables a portion of the L2 cache to be used for persistent data access, and the asynchronous copy instruction allows data to be loaded directly from the main memory into shared memory. The new implementation improves memory bandwidth utilization from 71.5% to 81.2% of the peak available on the A100 architecture.

Back to MCHPC’21: Workshop on Memory Centric High Performance Computing Archive Listing

Back to Full Workshop Archive Listing