A FACT-Based Approach: Making ML Collective Autotuning Feasible on Exascale Systems

SC21 Proceedings

A FACT-Based Approach: Making ML Collective Autotuning Feasible on Exascale Systems

Workshop:ExaMPI: Workshop on Exascale MPI

Authors: Michael Wilkins (Northwestern University), Yanfei Guo and Rajeev Thakur (Argonne National Laboratory (ANL)), Nikos Hardavellas and Peter Dinda (Northwestern University), and Min Si (Facebook)

Abstract: Machine learning (ML) autotuners use supervised learning to select MPI collective algorithms, significantly improving collective performance. However, a user may find it difficult to understand the benefit of autotuners because we lack a methodology to quantify their performance. Additionally, to obtain the advertised performance, ML model training requires benchmark data from a vast majority of the feature space. Collecting such data regularly on large scale systems consumes far too much time and resources. To address these challenges, we contribute (1) a performance evaluation framework to compare and improve collective autotuner designs and (2) the Feature scaling, Active learning, Converge, Tune hyperparameters (FACT) approach, a three-part methodology to minimize the training data collection time (and thus maximize practicality at larger scale) without sacrificing accuracy. On a production scale system, our methodology produces a model of equal accuracy using 6.88x less training data collection time.

Back to ExaMPI: Workshop on Exascale MPI Archive Listing

Back to Full Workshop Archive Listing