Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators
TimeWednesday, 17 November 202111am - 11:30am CST
DescriptionMatrix-matrix multiplication operations (GEMMs) are important in many HPC and machine-learning applications. They are often mapped to discrete accelerators (e.g., GPUs) to improve performance. We find, however, that large tall/skinny and fat/short matrices benefit little from discrete acceleration and also do not perform well on a CPU. Such matrices are prevalent in important workloads, such as deep-learning inference within large-scale datacenters. We demonstrate the large potential of accelerating these GEMMs with processing in the main CPU memory, where processing-in-memory units (PIMs) take advantage of otherwise untapped bandwidth without requiring data copies. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU. Our evaluation of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU and prior main-memory acceleration approaches.