Intelligent Job Scheduling for Next Generation HPC Systems
TimeWednesday, 17 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionBoth high performance computing (HPC) infrastructures and applications are undergoing significant changes. The emerging HPC applications are not only compute-intensive, but also data- and memory-intensive. To meet the diverse workload demands, new hardware components, such as GPU and burst buffer, are incorporated into the next generation systems. However, existing HPC job schedulers typically leverage simple heuristics to schedule jobs. The rapid development in system infrastructure and diverse workloads pose serious challenges to the traditional heuristic approaches. We propose an intelligent HPC job scheduling framework to address these emerging challenges. Our research takes advantage of advanced machine learning and optimization techniques to extract useful workload- and system-specific information and to further guide the framework to make informative scheduling decisions under various system configuration and diverse workloads. Our framework consists of three main components. The first component is job runtime adjuster, which leverages a machine learning model to improve the accuracy of user-provided job runtime estimates. The second component enhances multi-resource scheduling by exploring multi-objective genetic algorithm. The third component enables the scheduler to automatically learn efficient scheduling policies via reinforcement learning. Our proposed design demonstrates significant performance improvements over the state-of-the-art schedulers under various resources and applications settings.