Delving Into the Abyss: A Distributed Decompression System for Indexing Compressed Repositories
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
TimeThursday, 18 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionDiscovery and use of scientific data is dependent on descriptive metadata. Unfortunately, data lakes often contain compressed data, which are difficult to index and automatically extract metadata from due to storage and I/O constraints when inflating and processing recursively compressed data. Here we describe Abyss, a system capable of indexing large amounts of recursively compressed data. Abyss utilizes a function as a service architecture to execute decompression and crawling functions across multiple compute endpoints, allowing Abyss to index data at scale and overcome storage and I/O limitations. Abyss applies methods for predicting a file’s decompressed size and batching files to remote endpoints to optimize indexing. We present a prototype implementation of Abyss and demonstrate that it is capable of indexing real world and synthetic compressed data from a 9TB institutional repository.