TimeSunday, 14 November 20212:53pm - 3pm CST
DescriptionIn this work we show that the OpenMP accelerator offloading model is sufficient to seamlessly and efficiently utilize more than a single compute node, and its connected accelerators.

Without source code or compiler modifications we run an OpenMP offload capable program on a remote CPU, or remote accelerator (e.g., GPU), as if it was a local one. For applications that support multi-device offloading, any combination of local and remote CPUs and accelerators can be utilized simultaneously, fully transparent to the user. Our low-overhead implementation is integrated into the LLVM/OpenMP compiler infrastructure as a plugin and is publicly available (in parts) with LLVM 12 and later.

To evaluate our work we provide detailed studies on scaling results for two HPC proxy applications. We show perfect scaling across dozens of GPUs in multiple hosts with effectiveness proportional to the ratio of computation versus memory transfer time.
