Exploration of Congestion Control Techniques on Dragonfly-Class HPC Networks Through Simulation
Parallel Programming Languages and Models
TimeMonday, 15 November 202111:30am - 12pm CST
DescriptionEnsuring optimal communication latency in high-performance computing (HPC) networks is of critical importance to the efficient operation of facilitated applications. Different application operations and types of tasks, such as I/O operations, can create a variety of traffic patterns across the system interconnect. Some communication patterns, however, can be problematic for overall system performance.
One traffic pattern of particular concern is the many-to-one or incast. When packets sent from many different endpoints target a singular destination, or a small number of destinations, they can overwhelm the receiving endpoints' ability to process the traffic, resulting in a cascading effect of induced congestion. This can have broad-reaching, detrimental effects to other applications as their data streams encounter induced congestion.
The concept of congestion control has been explored in various HPC system technologies and is an important feature in state-of-the-art networks such as Infiniband and the Cray Slingshot interconnect. Because access to physical, full-scale interconnects of bleeding-edge design can be challenging and the exact mechanisms of operation not publicly known, we look to simulation to explore techniques for congestion control with a fine level of flexibility not available on real-world systems.
We present and explore a mechanism for congestion control which seeks to detect network congestion, identify its cause and abate it by throttling injection of identified aggressor endpoints. Our work proposes, discusses and evaluates two similar implementations of this mechanism for congestion control in two different network simulators and their ability to mitigate the effects of congestion on application communication performance and general system packet latencies.