Presentation

· Contributors · Organizations · Search Program

ET: Re-Thinking Self-Attention for Transformer Models on GPUs

SessionLarge Scale Neural Network Training: Part I

Authors

Event Type

Paper

Tags

Registration Categories

TimeTuesday, 16 November 20211:30pm - 2pm CST

Location230-231-232

DescriptionTransformer-based deep learning models have become a ubiquitous vehicle driving a variety of natural language processing (NLP) -related tasks beyond their accuracy ceiling. These models, however, also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T., which re-thinks self-attention computation transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimization, as well as operation reordering optimizations. Second, we achieve tensor core aware weight pruning by revamping the existing pruning algorithms, as well as designing new ones for transformers. This work goes further by introducing an attention-aware adaptive pruning design. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistillBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions; i.e., TensorRT and FasterTransformer.

Download PDF

Paper available from the ACM OpenTOC

Archive view

Authors

Shiyang Chen

Stevens Institute of Technology

Shaoyi Huang

University of Connecticut