SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

HyperQueue: Overcoming Limitations of HPC Job Managers

Authors: Stanislav Böhm, Jakub Beránek, Vojtěch Cima, Roman Macháček, Vyomkesh Jha, Alfréd Kočí, Branislav Jansík, and Jan Martinovič (IT4Innovations, Czech Republic)

Abstract: In recent years, HPC workloads and communities have undergone substantial paradigm shifts. There is an increasing amount of users that want to leverage HPC clusters to execute many simple and embarrassingly parallel tasks as easily as possible. Due to the limitations of traditional HPC job managers, however, these users must often resort to manual aggregation of tasks into a smaller number of jobs to reduce job manager overhead. This approach is both labour-intensive and inefficient, as it lacks dynamic load balancing required to fully utilize computational nodes with tens or hundreds of cores. We introduce HyperQueue, a task scheduling runtime that can execute a large amount of tasks on top of an HPC job manager by automatically aggregating tasks into jobs and dynamically load balancing them across all allocated nodes and CPU cores. HyperQueue is an open-source tool that is designed for ease of use and deployment.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF

Back to Poster Archive Listing