Workshop:CANOPIE-HPC: Containers and New Orchestration Paradigms for Isolated Environments in HPC
Authors: Claudia Misale and Maurizio Drocco (IBM - TJ Watson Research Center), Daniel Milroy (Lawrence Livermore National Laboratory), Carlos Eduardo Arango Gutierrez (Red Hat Inc), Stephen Herbein and Dong Ahn (Lawrence Livermore National Laboratory), and Yoonho Park (IBM Corporation)
Abstract: In this work, we address the problem of running HPC workloads efficiently on Kubernetes clusters. To do so, we compare the Kubernetes' default scheduler with KubeFlux, a Kubernetes plug-in scheduler built on the Flux graph-based scheduler, on a 34-node Red Hat OpenShift cluster on IBM Cloud. We detail how scheduling can affect the performance of GROMACS, a well-known HPC application, and we show that KubeFlux can improve its performance through better pod scheduling. In our tests, KubeFlux demonstrates the tendency to limit the number of subnets spanned by a job and the maximum number of pods per node, translating to a >2x speedup over the Kubernetes default scheduler in several cases.