No Travel? No Problem.

Remote Participation
Early Career Lighting Talks – Dynamically Tiling Sparse Data for More Efficient Memory Accesses
Author/Presenter
Event Type
Workshop
Tags
Online Only
Career Development
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
HPC Community Collaboration
Registration Categories
W
TimeSunday, 14 November 20214:10pm - 4:15pm CST
LocationOnline
DescriptionTensor algebra is a ubiquitous primitive in numerical computations. When considering general semirings, tensor arithmetic can also be used to express graph algorithms [1,3,13,14,18,21]. In particular, sparse tensor algebra involving multiple sparse tensors is a key kernel in several high-performance computing (HPC) applications. The simplest multi-sparse kernel is the multiplication of two sparse matrices, or sparse-sparse matrix multiplication (SpMSpM). The multidimensional analogues to SpMSpM involve tensor contraction—the higher-order equivalent of a dot product. Contractions of sparse tensors are important primitives in many of the core areas of tensor computations, including computational chemistry methods [5,12,19] and sparse tensor decomposition [2,9].

Despite their importance, the performance of these sparse tensor kernels on modern architectures is memory bound and suffers from low arithmetic intensity. This arises from poor data reuse across operands and excess memory accesses. Prior works attempt to improve reuse through changes in dataflow or through static tiling. For example, in sparse-sparse matrix multiplication (SpGEMM), an outer-product algorithm provides output reuse, while the row-wise algorithm provides reuse on only one operand. Due to irregular sparsity, traditional dense and static tiling is insufficient in bringing overall memory accesses down to the lower bound. Our proposed portable primitive, dynamic reflexive tiling (DRT), accounts for irregularity by co-tiling all operands at runtime for improved data reuse, memory accesses, and overall arithmetic intensity.

The key idea in DRT is to dynamically co-tile all tensors into non-uniform coordinate-space regions: tiles whose volumes differ when measured in coordinates. This technique accounts for irregular data sparsity by maximizing tile occupancy, subject to the buffer capacity. To maximize utilization across the duration of kernel execution, DRT not only changes tile shape across different regions of each tensor but also over time for the same region, based on how data is later reused. We implement DRT as a hardware mechanism integrated into prior accelerators. We add a new, pipelined hardware unit called the tile extractor, which dynamically builds new sub-tiles and distributes them to the next level of memory or compute in the hierarchy. We foresee that this capability, either as a portable hardware unit or as a configurable software primitive, will be integrated into architectures like GPUs to enable DRT on traditional platforms.

Figure 1 shows the overall DRAM traffic, for SpGEMM, in an outer-product implementation, row-wise product implementation, static tiling implementation, and in our proposed DRT mechanism integrated into an ExTensor-like [11] architecture. The red dots indicate the lower bound on memory accesses. These results indicate that DRT integrated within ExTensor provides significant reductions in memory traffic across all operands. Figure 2 shows the overall impact of DRT on an outer-product algorithm, an inner-product algorithm, and in our designed accelerator, respectively. The red dots indicate the maximum achievable performance for the current arithmetic intensity. Both results are from experiments over a subset of the SuiteSparse matrices, for a matrix multiplied by itself. Overall, DRT successfully improves arithmetic intensity over the baseline implementations (by 5.1x and 2.6x, respectively), while also improving performance over an SpGEMM implementation on the CPU.
Back To Top Button