Designing HPC Systems with High-Performance Networks: Advanced Features, Challenges, and Usage
TimeSunday, 14 November 20211pm - 5pm CST
DescriptionAs IB, HSE, RoCE and Omni-Path technologies mature, they are being used to design and deploy various HPC clusters with GPGPUs supporting MPI, storage and parallel file systems, cloud computing systems with SR-IOV, DL/ML and data science systems. These systems bring new challenges in performance, scalability, portability, reliability and network congestion. This tutorial starts with an overview of these systems. Advanced hardware and software features of IB, Omni-Path, HSE, and RoCE and their ability to address these challenges will be emphasized. Next, we will focus on Open Fabrics RDMA and Libfabrics programming, network management infrastructure and tools to effectively use these systems. A common set of challenges faced while designing these systems will be presented. Case studies focusing on domain-specific challenges in designing these systems, their solutions and sample performance numbers will be presented. Finally, hands-on exercises will be carried out with Open Fabrics/Libfabrics software stacks and network management tools.