Presentation

· Contributors · Organizations · Search Program

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

SessionAccelerator Architectures

Authors

Benjamin Y. Cho

Jeageun Jung

Mattan Erez

Event Type

Paper

Tags

Registration Categories

TimeWednesday, 17 November 202111am - 11:30am CST

Location225-226

DescriptionMatrix-matrix multiplication operations (GEMMs) are important in many HPC and machine-learning applications. They are often mapped to discrete accelerators (e.g., GPUs) to improve performance. We find, however, that large tall/skinny and fat/short matrices benefit little from discrete acceleration and also do not perform well on a CPU. Such matrices are prevalent in important workloads, such as deep-learning inference within large-scale datacenters. We demonstrate the large potential of accelerating these GEMMs with processing in the main CPU memory, where processing-in-memory units (PIMs) take advantage of otherwise untapped bandwidth without requiring data copies. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU. Our evaluation of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU and prior main-memory acceleration approaches.

Download PDF

Paper available from the ACM OpenTOC

Archive view

Authors

Benjamin Y. Cho

University of Texas

Advanced Micro Devices (AMD) Inc

No Travel? No Problem.