HyperQueue: Overcoming Limitations of HPC Job Managers
TimeThursday, 18 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionIn recent years, HPC workloads and communities have undergone substantial paradigm shifts. There is an increasing amount of users that want to leverage HPC clusters to execute many simple and embarrassingly parallel tasks as easily as possible. Due to the limitations of traditional HPC job managers, however, these users must often resort to manual aggregation of tasks into a smaller number of jobs to reduce job manager overhead. This approach is both labour-intensive and inefficient, as it lacks dynamic load balancing required to fully utilize computational nodes with tens or hundreds of cores. We introduce HyperQueue, a task scheduling runtime that can execute a large amount of tasks on top of an HPC job manager by automatically aggregating tasks into jobs and dynamically load balancing them across all allocated nodes and CPU cores. HyperQueue is an open-source tool that is designed for ease of use and deployment.