Workshop:PMBS21: The 12th International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computer Systems
Authors: Felippe Vieira Zacarias (Barcelona Supercomputing Center (BSC), Polytechnic University of Catalonia); Vinicius Petrucci (University of Pittsburgh; Federal University of Bahia, Brazil); and Paul Carpenter (Barcelona Supercomputing Center (BSC))
Abstract: Jobs running on HPC systems can vary dramatically due to the intrinsic differences in application resource requirements (e.g. memory or cores). Since HPC applications run on a number of self-contained servers whose capacities are fixed at design time, there is often a mismatch between the resource provisioning and the needs of the submitted jobs, leading to stranded and underutilized resources. This is because HPC systems assume the prevalent server-based architecture, which couples together memory and processing resources within a server. To cope efficiently with the demands, disaggregated memory has been proposed to allow flexible and fine-grained allocation of memory capacity to compute jobs.
This paper makes an important step towards understanding the facets of resource allocation and job requirements on disaggregated memory systems. We analyze the implications on HPC system operation, user experience and system performance when resources can be overestimated by users. To conduct our studies, we leverage a disaggregated simulation infrastructure implemented on a popular HPC resource manager. Our results show that the effects of doubling the memory demand in response time can be less than 8%.