Presentation

· Contributors · Organizations · Search Program

Mitigating the Metadata Mess: Autonomous Metadata Extraction Pipelines for Large-Scale Data Repositories

SessionACM Student Research Competition Posters Display

Authors

Matthew Chen

Erica Hsu

Event Type

ACM Student Research Competition: Graduate Poster

ACM Student Research Competition: Undergraduate Poster

Posters

Tags

Registration Categories

TimeThursday, 18 November 20218:30am - 5pm CST

LocationSecond Floor Atrium

DescriptionMany scientific repositories are rendered useless due to their enormous size (exceeding petabytes of data across billions of files) and lack of descriptive metadata to aid discovery, understanding, and use. Building on a distributed metadata extraction service, Xtract, we propose a scheduler designed to optimize the amount of metadata extracted from large scientific repositories subject to finite compute budgets. We accomplish this by leveraging machine learning models to predict the likelihood that each metadata extractor can retrieve nonempty metadata from each file. We then feed these probabilities along with other file attributes into a scheduler that maximizes metadata yield over time. We demonstrate the viability of the scheduler on a real-world data repository and show improved metadata extraction performance by measuring the metadata quality extracted as the scheduler passes through each file extractor pair.

Archive view

Authors

Matthew Chen

University of Illinois

University of Chicago

Erica Hsu

Carnegie Mellon University

University of Chicago

No Travel? No Problem.