Mitigating the Metadata Mess: Autonomous Metadata Extraction Pipelines for Large-Scale Data Repositories

SC21 Proceedings

Mitigating the Metadata Mess: Autonomous Metadata Extraction Pipelines for Large-Scale Data Repositories

Student: Matthew Chen (University of Illinois, University of Chicago), Erica Hsu (Carnegie Mellon University, University of Chicago)
Supervisor: Kyle Chard (University of Chicago)

Abstract: Many scientific repositories are rendered useless due to their enormous size (exceeding petabytes of data across billions of files) and lack of descriptive metadata to aid discovery, understanding, and use. Building on a distributed metadata extraction service, Xtract, we propose a scheduler designed to optimize the amount of metadata extracted from large scientific repositories subject to finite compute budgets. We accomplish this by leveraging machine learning models to predict the likelihood that each metadata extractor can retrieve nonempty metadata from each file. We then feed these probabilities along with other file attributes into a scheduler that maximizes metadata yield over time. We demonstrate the viability of the scheduler on a real-world data repository and show improved metadata extraction performance by measuring the metadata quality extracted as the scheduler passes through each file extractor pair.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: PDF

Back to Poster Archive Listing