Workshop:WORKS21: 16th Workshop on Workflows in Support of Large-Scale Science
Authors: Valerie Hayot-Sasson and Tristan Glatard (Concordia University) and Ariel Rokem (University of Washington)
Abstract: To support the growing demands of neuroscience
applications, researchers are transitioning to cloud computing
for its scalable, robust and elastic infrastructure. Nevertheless,
large datasets residing in object stores may result in significant
data transfer overheads during workflow execution. Prefetching,
a method to mitigate the cost of reading in mixed workloads,
masks data transfer costs within processing time of prior tasks.
We present an implementation of “Rolling Prefetch”, a Python
library that implements a particular form of prefetching from
AWS S3 object store, and we quantify its benefits.
Rolling Prefetch extends S3Fs, a Python library exposing AWS
S3 functionality via a file object, to add prefetch capabilities. In
measured analysis performance of a 500 GB brain connectivity
dataset stored on S3, we found that prefetching provides signifi-
cant speed-ups of up to 1.86×, even in applications consisting
entirely of data loading. The observed speed-up values are
consistent with our theoretical analysis. Our results demonstrate
the usefulness of prefetching for scientific data processing on
cloud infrastructures and provide an implementation applicable
to various application domains.
Back to WORKS21: 16th Workshop on Workflows in Support of Large-Scale Science Archive Listing