Delving Into the Abyss: A Distributed Decompression System for Indexing Compressed Repositories

SC21 Proceedings

Delving Into the Abyss: A Distributed Decompression System for Indexing Compressed Repositories

Student: Ryan Wong (Northwestern University, University of Chicago)
Supervisor: Tyler Skluzacek (University of Chicago)

Abstract: Discovery and use of scientific data is dependent on descriptive metadata. Unfortunately, data lakes often contain compressed data, which are difficult to index and automatically extract metadata from due to storage and I/O constraints when inflating and processing recursively compressed data. Here we describe Abyss, a system capable of indexing large amounts of recursively compressed data. Abyss utilizes a function as a service architecture to execute decompression and crawling functions across multiple compute endpoints, allowing Abyss to index data at scale and overcome storage and I/O limitations. Abyss applies methods for predicting a file’s decompressed size and batching files to remote endpoints to optimize indexing. We present a prototype implementation of Abyss and demonstrate that it is capable of indexing real world and synthetic compressed data from a 9TB institutional repository.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF

Back to Poster Archive Listing