Workshop:SC21 SuperCompCloud: 5th International Workshop on Interoperability of Supercomputing and Cloud Technologies
Authors: Sadaf R. Alam (Swiss National Supercomputing Centre (CSCS), ETH Zürich) and Miguel Gila, Mark Klein, and Maxime Martinasso (Swiss National Supercomputing Centre (CSCS))
Abstract: As supercomputing systems gradually become an integral part of data driven workflows such as ML and AI or tightly-coupled pre- and post-processing pipelines, users need programmable access to shared resources to avoid moving large volume of data to dedicated systems or to public cloud providers. Public clouds or the private ones using technologies like OpenStack, multi-tenancy on shared hardware has been a commonplace over a decade, offering users programmable and privileged access to resources like compute, network and storage. Such access is unavailable to users on batch-scheduled, multi-Petascale supercomputing systems, which are designed for achieving close-to-metal performance for scientific applications at unprecedented scales. In this paper, we focus on multi-tenancy within hardware and software stacks of Cray-HPE EX Shasta supercomputing systems for creating high performance and cloud clusters for HPC and AI/ML workloads respectively. Using orchestration examples for zero downtime upgrades of virtual clusters, we demonstrate benefits of multi-tenant machines for achieving close-to-metal performance, as well as elasticity and customization of resources without interruption to operational services.