Presentation

· Contributors · Organizations · Search Program

Tools and Best Practices for Distributed Deep Learning on a Supercomputer

Presenters

Weijia Xu

Zhao Zhang

David Walling

Event Type

Tutorial

Tags

Registration Categories

TimeSunday, 14 November 20211pm - 5pm CST

LocationOnline

DescriptionThis tutorial is a practical guide on how to run distributed deep learning over multiple compute nodes effectively. Deep Learning (DL) has emerged as an effective analysis method and has been adapted quickly across many scientific domains in recent years. Due to its inherent high computational requirement, however, application of DL is limited by the available computational resources. Supercomputers show an unparalleled capacity to reduce DL training time: high-performance computing (HPC) techniques have been used to speed up parallel DL training. Therefore, distributed deep learning has great potential to augment DL applications by leveraging existing high-performance computing clusters. We will give an overview of state-of-the-art DL frameworks and an interactive hands-on session to help attendees running distributed DL on Frontera Supercomputers at the Texas Advanced Computing Center. We will discuss the best practices on how to scale, evaluate, and tune performance.

Promotional Video

Presenters

Weijia Xu

Texas Advanced Computing Center (TACC)

No Travel? No Problem.