Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

SC21 Proceedings

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Authors: Benjamin Y. Cho (University of Texas, Advanced Micro Devices (AMD) Inc) and Jeageun Jung and Mattan Erez (University of Texas)

Abstract: Matrix-matrix multiplication operations (GEMMs) are important in many HPC and machine-learning applications. They are often mapped to discrete accelerators (e.g., GPUs) to improve performance. We find, however, that large tall/skinny and fat/short matrices benefit little from discrete acceleration and also do not perform well on a CPU. Such matrices are prevalent in important workloads, such as deep-learning inference within large-scale datacenters. We demonstrate the large potential of accelerating these GEMMs with processing in the main CPU memory, where processing-in-memory units (PIMs) take advantage of otherwise untapped bandwidth without requiring data copies. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU. Our evaluation of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU and prior main-memory acceleration approaches.

Back to Technical Papers Archive Listing