Tools and Best Practices for Distributed Deep Learning on a Supercomputer
Presenters
Event Type
Tutorial
Online Only
Machine Learning and Artificial Intelligence
TUT
TimeSunday, 14 November 20211pm - 5pm CST
LocationOnline
DescriptionThis tutorial is a practical guide on how to run distributed deep learning over multiple compute nodes effectively. Deep Learning (DL) has emerged as an effective analysis method and has been adapted quickly across many scientific domains in recent years. Due to its inherent high computational requirement, however, application of DL is limited by the available computational resources. Supercomputers show an unparalleled capacity to reduce DL training time: high-performance computing (HPC) techniques have been used to speed up parallel DL training. Therefore, distributed deep learning has great potential to augment DL applications by leveraging existing high-performance computing clusters. We will give an overview of state-of-the-art DL frameworks and an interactive hands-on session to help attendees running distributed DL on Frontera Supercomputers at the Texas Advanced Computing Center. We will discuss the best practices on how to scale, evaluate, and tune performance.
Promotional Video