ET: Re-Thinking Self-Attention for Transformer Models on GPUs

SC21 Proceedings

ET: Re-Thinking Self-Attention for Transformer Models on GPUs

Authors: Shiyang Chen (Stevens Institute of Technology), Shaoyi Huang (University of Connecticut), Santosh Pandey (Stevens Institute of Technology), Bingbing Li (University of Connecticut), Guang R. Gao and Long Zheng (University of Delaware), Caiwen Ding (University of Connecticut), and Hang Liu (Stevens Institute of Technology)

Abstract: Transformer-based deep learning models have become a ubiquitous vehicle driving a variety of natural language processing (NLP) -related tasks beyond their accuracy ceiling. These models, however, also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T., which re-thinks self-attention computation transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimization, as well as operation reordering optimizations. Second, we achieve tensor core aware weight pruning by revamping the existing pruning algorithms, as well as designing new ones for transformers. This work goes further by introducing an attention-aware adaptive pruning design. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistillBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions; i.e., TensorRT and FasterTransformer.

Presentation: file

Back to Technical Papers Archive Listing