A FACT-Based Approach: Making ML Collective Autotuning Feasible on Exascale Systems
Parallel Programming Languages and Models
TimeSunday, 14 November 20212:30pm - 3pm CST
DescriptionMachine learning (ML) autotuners use supervised learning to select MPI collective algorithms, significantly improving collective performance. However, a user may find it difficult to understand the benefit of autotuners because we lack a methodology to quantify their performance. Additionally, to obtain the advertised performance, ML model training requires benchmark data from a vast majority of the feature space. Collecting such data regularly on large scale systems consumes far too much time and resources. To address these challenges, we contribute (1) a performance evaluation framework to compare and improve collective autotuner designs and (2) the Feature scaling, Active learning, Converge, Tune hyperparameters (FACT) approach, a three-part methodology to minimize the training data collection time (and thus maximize practicality at larger scale) without sacrificing accuracy. On a production scale system, our methodology produces a model of equal accuracy using 6.88x less training data collection time.