SC21 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Exploration of Congestion Control Techniques on Dragonfly-Class HPC Networks Through Simulation


Workshop:PMBS21: The 12th International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computer Systems

Authors: Neil McGlohon (Rensselaer Polytechnic Institute (RPI)), Scott Hemmert (Sandia National Laboratories), Kevin A. Brown (Argonne National Laboratory (ANL)), Michael Levenhagen (Sandia National Laboratories), Sudheer Chunduri and Robert B. Ross (Argonne National Laboratory (ANL)), and Christopher D. Carothers (Rensselaer Polytechnic Institute (RPI))


Abstract: Ensuring optimal communication latency in high-performance computing (HPC) networks is of critical importance to the efficient operation of facilitated applications. Different application operations and types of tasks, such as I/O operations, can create a variety of traffic patterns across the system interconnect. Some communication patterns, however, can be problematic for overall system performance.

One traffic pattern of particular concern is the many-to-one or incast. When packets sent from many different endpoints target a singular destination, or a small number of destinations, they can overwhelm the receiving endpoints' ability to process the traffic, resulting in a cascading effect of induced congestion. This can have broad-reaching, detrimental effects to other applications as their data streams encounter induced congestion.

The concept of congestion control has been explored in various HPC system technologies and is an important feature in state-of-the-art networks such as Infiniband and the Cray Slingshot interconnect. Because access to physical, full-scale interconnects of bleeding-edge design can be challenging and the exact mechanisms of operation not publicly known, we look to simulation to explore techniques for congestion control with a fine level of flexibility not available on real-world systems.

We present and explore a mechanism for congestion control which seeks to detect network congestion, identify its cause and abate it by throttling injection of identified aggressor endpoints. Our work proposes, discusses and evaluates two similar implementations of this mechanism for congestion control in two different network simulators and their ability to mitigate the effects of congestion on application communication performance and general system packet latencies.





Back to PMBS21: The 12th International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computer Systems Archive Listing



Back to Full Workshop Archive Listing