Workshop:PDSW: Sixth International Parallel Data Systems Workshop
Authors: Cong Xu, Suparna Bhattacharya, and Martin Foltin (Hewlett Packard Enterprise); Suren Byna (Lawrence Berkeley National Laboratory (LBNL)); and Paolo Faraboschi (Hewlett Packard Enterprise)
Abstract: DNN models trained with large datasets can perform rich deep learning tasks with high accuracy. However, feeding huge volumes of training data exerts significant pressure on IO subsystems as the entire data is re-loaded in random order on every iteration to enable convergence, with very little scope for reuse. To address this challenge, we co-optimize data tiering and iteration in DNN training for any given dataset and model with bandwidth and convergence conscious mini-epoch training (MET). This approach can substantially reduce the IO bandwidth required to provide sustained read throughput. Further, we introduce two different feedback mechanisms to adjust the repeating factor over each mini-epoch during the training. We have evaluated three different applications with MET. Most of them work out-of-box with modest MET parameters. The adaptive repeating factor design was able to gain back most of the accuracy drop due lo large MET parameters.
Back to PDSW: Sixth International Parallel Data Systems Workshop Archive Listing