Generalizable Coordination of Large Multiscale Ensembles: Challenges and Learnings at Scale
Machine Learning and Artificial Intelligence
Resource Management and Scheduling
TimeTuesday, 16 November 202110:30am - 11am CST
DescriptionThe advancement of machine learning techniques and the availability of heterogeneous computing are propelling the demand for large multiscale simulations that can automatically and autonomously couple diverse components to solve complex problems at multiple scales. Nevertheless, the current capabilities are limited to coupling two scales.
In the first-ever demonstration of using three resolution scales, we present a scalable and generalizable framework as we expand MuMMI, an award-winning workflow, beyond its original design. We discuss the challenges and learnings in executing a massive simulation campaign that utilized over 600,000 node-hours on Summit, achieving more than 98% GPU occupancy for over 83% of the time. We enable orders of magnitude scaling, including coordinating 24,000 jobs, and managing several TBs of new data per day and over a billion files in total. Finally, we describe the generalizability of our framework and discuss how the presented framework may be used for new applications.