ET: Re-Thinking Self-Attention for Transformer Models on GPUs
Machine Learning and Artificial Intelligence
TimeTuesday, 16 November 20211:30pm - 2pm CST
DescriptionTransformer-based deep learning models have become a ubiquitous vehicle driving a variety of natural language processing (NLP) -related tasks beyond their accuracy ceiling. These models, however, also suffer from two pronounced challenges, that is, gigantic model size and prolonged turnaround time. To this end, we introduce E.T., which re-thinks self-attention computation transformer models on GPUs with the following contributions: First, we introduce a novel self-attention architecture, which encompasses two tailored self-attention operators with corresponding sequence length-aware optimization, as well as operation reordering optimizations. Second, we achieve tensor core aware weight pruning by revamping the existing pruning algorithms, as well as designing new ones for transformers. This work goes further by introducing an attention-aware adaptive pruning design. Taken together, we evaluate E.T. across a variety of benchmarks for Transformer, BERTBASE and DistillBERT, where E.T. presents superior performance over the mainstream projects, including the popular Nvidia Enterprise solutions; i.e., TensorRT and FasterTransformer.