No Travel? No Problem.

Remote Participation
Early Career Lighting Talks – Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations
Author/Presenter
Event Type
Workshop
Tags
Online Only
Career Development
Diversity Equity Inclusion (DEI)
Education and Training and Outreach
HPC Community Collaboration
Registration Categories
W
TimeSunday, 14 November 20214:05pm - 4:10pm CST
LocationOnline
DescriptionWe present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos removes the global barriers and exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice.

We evaluate and analyze the performance of Atos vs. BSP on two applications: breadth-first search and PageRank. Atos implementations achieve geomean speedups of {3.44x, 2.1x} and peak speedups of {12.8x, 3.2x} on two case studies respectively, compared to a state-of-the-art BSP GPU implementation.

In the future, we plan to extend this framework to multi-GPU and multi-node systems and expect Atos’s task-based, global-synchronization-free programming model is likely to be more amenable for use in a distributed environment. As the number of GPUs in modern computer systems increases, the fraction of total runtime spent on communication and synchronization also increases. Under distributed context, Atos 1) removes the global barriers thus largely reducing the synchronization cost; 2) is able to generate fine-grained messages and send them immediately without synchronization, leading to better communication-computation overlap.
Back To Top Button