Author: Wei Der Chien (KTH Royal Institute of Technology, Sweden)
Advisor: Stefano Markidis (KTH Royal Institute of Technology, Sweden), Artur Podobas (KTH Royal Institute of Technology, Sweden)
Abstract: In recent years, HPC systems have emerged as an attractive option to speed up large-scale Machine Learning (ML) workloads. HPC systems can tremendously improve learning speed with lots of GPUs and fast interconnect. However, ML workloads that are data-intensive differ significantly from traditional HPC I/O. Popular ML frameworks are also not optimized for HPC hardware. At the same time, the I/O subsystems in emerging HPC machines are becoming increasingly heterogeneous (e.g. object storage, NVMe) and disaggregated (node-local storage). It is unclear what is the most efficient way to leverage these emerging I/O systems for both traditional HPC and emerging ML workloads. In this thesis, we tackle the challenges from two directions. Firstly, we explore the challenges of I/O when running emerging ML workloads on existing HPC hardware. To achieve this, we research and develop profiling tools that are coupled with ML workloads. We illustrate how information from our tools is invaluable for I/O performance tuning and the importance of co-designing I/O with underlying workloads. Secondly, we explore emerging I/O subsystems. In particular, we focus on a disruptive solution called object storage. While widely used in cloud applications, it is still unclear what is a suitable programming model for HPC applications. We develop an object store emulator that supports parallel I/O to show that it can tremendously improve I/O bandwidth comparing to shared-file collective I/O. Finally, to conclude this work, we present our work-in-progress programming models for writing parallel shared-file through disaggregated fast local storage and a near-storage ML preprocessing accelerator.
Thesis Canvas: pdf