Mitigating the Metadata Mess: Autonomous Metadata Extraction Pipelines for Large-Scale Data Repositories
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
TimeThursday, 18 November 20218:30am - 5pm CST
LocationSecond Floor Atrium
DescriptionMany scientific repositories are rendered useless due to their enormous size (exceeding petabytes of data across billions of files) and lack of descriptive metadata to aid discovery, understanding, and use. Building on a distributed metadata extraction service, Xtract, we propose a scheduler designed to optimize the amount of metadata extracted from large scientific repositories subject to finite compute budgets. We accomplish this by leveraging machine learning models to predict the likelihood that each metadata extractor can retrieve nonempty metadata from each file. We then feed these probabilities along with other file attributes into a scheduler that maximizes metadata yield over time. We demonstrate the viability of the scheduler on a real-world data repository and show improved metadata extraction performance by measuring the metadata quality extracted as the scheduler passes through each file extractor pair.