Data Commons and Data Ecosystems for Biomedical Data and the ML/AI Applications Over Them

Authors: Robert Grossman (University of Chicago, Open Commons Consortium), Ankit Malhotra (Amazon Web Services)

Abstract: The biomedical research community is increasingly developing data commons and other cloud-based data platforms to manage, analyze and share their large biomedical datasets, particularly large genomics and imaging datasets. Some examples include the NCI Genomic Data Commons, the NCI Cancer Research Data Commons, NHLBI BioData Catalyst, and NIBIB Medical Imaging and Data Resource Center (MIDRC). In this BOF, we provide an update on these platforms and then invite the audience to participate in a roadmap for future developers, especially around cloud-based high-performance data stores integrated with machine learning and AI capabilities.

Long Description: Data commons are cloud-based software platforms that co-locate: 1) data, 2) computing infrastructure, and 3) commonly used software applications, tools and services to create an interoperable resource for managing, analyzing, integrating and sharing data with a community. Data commons are increasingly important for managing large amounts of biomedical data, and have proved especially important for large scale genomics and imaging. Data ecosystems contain multiple data commons, cloud computing resources, knowledgebases, and other applications that can interoperate using a common set of services for authentication, authorization, accessing, and analyzing data.

There are now multiple cloud-based data commons that manage petabytes of genomic data and scalably process and harmonize the data using multiple optimized workflows. A good example is the Genomic Data Commons which hosts over 3.5 PB of data, is used by over 50,000 unique researchers each month, and provides over 2 PB of data to the research community each month.

The first half of the session we will have four short presentations covering: 1) an overview of cloud-based data commons and data ecosystems, 2) applications for building and executing workflows over the data they contain, 3) using data commons to accelerate drug discovery and repurposing, 4) using specialized cloud-based ML/AI hardware and services for deep learning.

The second half will include interactive discussions with the audiences around current challenges and future features and services of interest to the user and developer communities. Some of the discussion period will be devoted to cloud-based high performance storage and compute and their integration with machine learning and AI capabilities.

The goals of this BOF include: 1) bringing together members of the HPC community involved in all areas essential for building data commons and data ecosystems, analyzing the data they contain, and developing machine learning and AI applications over the platforms; 2) through discussions with audience members, understanding their current challenges and their interests in future features and services; and, 3) bringing together developers of data commons and data ecosystems; users of data commons and data ecosystems; and developers of ML/AI applications over them to begin to form a community.

We expect the outcomes of this BOF to be: a) increased awareness of data commons and data ecosystems and their role in biomedical data science; b) linking the HPC ML/AI community with the community building data platforms for the biomedical research community; and c) potential collaborations between these two communities.

Robert L. Grossman has led a number of successful SC BOFs before in the same area (high performance data platforms for scientific data), but has not done so in the last few years. The proposed BOF is a natural follow up to these past BOFs.

URL:

Back to Birds of a Feather Archive Listing