Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

SC21 Proceedings

Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

Workshop:7th Workshop on Machine Learning in High Performance Environment

Authors: Logan Ward and Ganesh Sivaraman (Argonne National Laboratory (ANL)); Greg Pauloski and Yadu Babuji (University of Chicago); Ryan Chard, Naveen Dandu, Paul Redfern, and Rajeev Assary (Argonne National Laboratory (ANL)); Kyle Chard (University of Chicago); and Larry Curtiss, Rajeev Thakur, and Ian Foster (Argonne National Laboratory (ANL))

Abstract: Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing just the implementations of individual tasks plus the logic used to choose which tasks to execute when. Colmena handles task dispatch, results collation, ML model invocation, and ML model (re)training, using Parsl to execute tasks on HPC systems. We describe the design of Colmena and illustrate its capabilities by applying it to electrolyte design, where it both scales to \num{65536} CPUs and accelerates the discovery rate for high-performance molecules by a factor of 100 over unguided searches.

Website:

Back to 7th Workshop on Machine Learning in High Performance Environment Archive Listing

Back to Full Workshop Archive Listing