Enabling and Scaling the HPCG Benchmark on the Newest Generation Sunway Supercomputer with 42 Million Heterogeneous Cores
Best Student Paper Finalist
TimeWednesday, 17 November 20212:30pm - 3pm CST
DescriptionWe study and evaluate performance optimization techniques for the HPCG benchmark on the newest generation Sunway supercomputer. Specifically, a two-level blocking scheme is proposed to expose adequate parallelism in the symmetric Gauss-Seidel kernel while keeping a fast convergence rate; a fine-grained kernel fusion technique is developed to alleviate the bandwidth load on local storage with small capacity; and a low overhead thread collaboration method is presented to efficiently move data between threads and hide its cost with data transfer operations. Test results show that the optimized HPCG code is able to exploit 73.0% of the theoretical memory bandwidth, and scale to over 42 million heterogeneous cores with 95.5% weak-scaling efficiency and 5.91 PFLOPS performance. We also study how the performance can be improved if the specific rules of HPCG are not fully obeyed, and design dependency-preserving parallelization and vectorization methods, further boosting performance to 27.6 PFLOPS.