Introducing Cloud-Native Supercomputing: Bare-Metal, Secured Supercomputing Architecture

SC21 Proceedings

Introducing Cloud-Native Supercomputing: Bare-Metal, Secured Supercomputing Architecture

Authors: Gilad Shainer (NVIDIA Corporation), Dhabaleswar Panda (Ohio State University), Paul Calleja (University of Cambridge)

Abstract: High-performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use, hosting a variety of users and applications. While providing the highest performance, supercomputers must also offer multi-tenancy security. Therefore they need to be designed as cloud-native platforms. The key element that enables this architecture is the data processing unit (DPU). DPU is a fully integrated data-center-on-a-chip platform that can manage the data center operating system instead of the host processor, enabling security and orchestration of the supercomputer. This architecture enables supercomputing platforms to deliver bare-metal performance, while natively supporting multi-node tenant isolation.

Long Description: High-performance computing and artificial intelligence have driven supercomputers into wide commercial use as the primary data processing engines enabling research, scientific discoveries, and product development. Extracting the highest possible performance from supercomputing systems while achieving efficient utilization has traditionally been incompatible with the secured, multi-tenant architecture of modern cloud computing. A cloud-native supercomputing platform aims at the goal of combining peak system performance with a modern zero-trust model for security isolation and multi-tenancy. The key element enabling this architecture transition is the data processing unit (DPU).

The DPU is a fully integrated data-center-on-a-chip platform that imbues each supercomputing node with two new capabilities: First, an infrastructure control plane processor that secures user access, storage access, networking, and life-cycle orchestration for the computing node in the data center or at the edge, offloading these services from the main compute processor and enabling bare-metal multi- tenancy. Second, an isolated line-rate data path with hardware acceleration that enables high performance. All this infrastructure allows a cloud-native HPC and AI platform architecture that delivers HPC performance on an infrastructure platform that meets cloud services requirements. The implementation of the infrastructure comes from the open-source community (for example UCF OpenSNAPI) and driven by standards, similarly as how some of the traditional HPC software stack that is maintained by a community including commercial companies, academic organizations, and government agencies.

We'll introduce the new supercomputing architecture, discuss the first cloud native supercomputers, in particular the world’s first academic cloud native supercomputer at the University of Cambridge, review first applications performance results, and explore future directions.

URL:

Back to Birds of a Feather Archive Listing