WORKS21: Invited Talk – FAIR Computational Workflows
Presenter
Event Type
Workshop
Online Only
Cloud and Distributed Computing
Scientific Computing
Workflows
W
TimeMonday, 15 November 20219:10am - 10am CST
LocationOnline
DescriptionThe FAIR principles (Findable, Accessible, Interoperable, Reusable) have laid a foundation for sharing and publishing digital assets, starting with data and now extending to all digital objects including software. The use of computational workflows has accelerated in the past few years driven by the need for repetitive and scalable data processing, access to and exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. COVID-19 pandemic has highlighted the value of workflows. Over 290 workflow systems are currently available, although a much smaller number are widely adopted. As first class, publishable research objects, it seems natural to apply FAIR principles to workflows. The FAIR data principles themselves originate from a desire to support automated data processing, by emphasizing machine accessibility of data and metadata. As workflows have a dual role as software and explicit method description, their FAIR properties draw from both data and software principles for descriptive metadata, software metrics, and versioning. However, workflows create unique challenges such as representing a complex lifecycle from specification to execution via a workflow system, through to the data created at the completion of the workflow. As workflows are chiefly concerned with the processing and creation of data they have an important role to play in ensuring and supporting data FAIRification.
The work on defining and improving the FAIRness of workflows has already started. A whole ecosystem of tools, guidelines and best practices are under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. For example, a fundamental tenet of FAIR is the universal availability of machine processable metadata. The European EOSC-Life Cluster has developed a metadata framework for FAIR workflows based on schema.org, RO-Crate and Common Workflow Language (CWL), and uses the GA4GH TRS API for a standardised communication protocol to support Accessibility. It has developed and runs the WorkflowHub registry which uses both the framework and the protocol to support workflow Findability. EOSC-Life have made great efforts to on-board community workflow platforms such as Galaxy, snakemake, nextflow and CWL to carry and use FAIR metadata for discovery and reuse. As FAIR software needs to be usable and not just reusable, EOSC-Life has also developed services for, e.g. workflow testing (LifeMonitor), execution and benchmarking.
The Interoperability principle is the hardest to unpack for both data and software. For workflows, interoperability follows two threads: (i) supporting workflow system interoperability through workflow descriptions independent of the underlying system (e.g. CWL and WDL) and (ii) workflow component composability. Workflows are ideally composed of modular building blocks and these and the workflows themselves are expected to be reused, refactored, recycled and remixed. Thus, FAIR applies "all the way down": at the specification and execution level, and for the whole workflow and each of its components. Composability also relates to reuse – that is, adapting, a workflow or its component “can be understood, modified, built upon or incorporated into other workflow”. Reuse challenges also include being able to capture and then move workflow components, dependencies, and application environments in such a way as not to affect the resulting execution of the workflow. Interoperability and Reusability present important obligations on software developers to ensure that tools and datasets are workflow ready data with clean I/O programmatic interfaces, no usage restrictions, use of community data standards, and that they are simple to install and designed for portability. Workflow developers can be both data-FAIR, by using and making identifiers, licensing data outputs, tracking data provenance and so on, and workflow-FAIR by managing versions, providing test data, and sharing libraries of composable and reusable workflow “blocks”. Communities are working on reviewing, validating and certifying canonical workflows.
While there are emerging tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists. Further work is required to understand use cases for reuse and enable reuse in the same or different environments. The FAIR principles for workflows need to be community-agreed before metrics can be considered to determine whether a workflow is FAIR, whether a workflow repository or registry is FAIR, and whether it is possible to automatically review whether a workflow’s dataflow is FAIR. Community activism, perhaps led by the platforms and registries coming together in a community group like WorkflowsRI, is needed to define principles, policies and best practices for FAIR workflows and to standardize metadata representation and collection processes. In this talk I will present current work on FAIR principles, practices and services for computational workflows, using developments in the European EOSC-Life Workflow Collaboratory and the Bioexcel Centre of Excellence.
The work on defining and improving the FAIRness of workflows has already started. A whole ecosystem of tools, guidelines and best practices are under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. For example, a fundamental tenet of FAIR is the universal availability of machine processable metadata. The European EOSC-Life Cluster has developed a metadata framework for FAIR workflows based on schema.org, RO-Crate and Common Workflow Language (CWL), and uses the GA4GH TRS API for a standardised communication protocol to support Accessibility. It has developed and runs the WorkflowHub registry which uses both the framework and the protocol to support workflow Findability. EOSC-Life have made great efforts to on-board community workflow platforms such as Galaxy, snakemake, nextflow and CWL to carry and use FAIR metadata for discovery and reuse. As FAIR software needs to be usable and not just reusable, EOSC-Life has also developed services for, e.g. workflow testing (LifeMonitor), execution and benchmarking.
The Interoperability principle is the hardest to unpack for both data and software. For workflows, interoperability follows two threads: (i) supporting workflow system interoperability through workflow descriptions independent of the underlying system (e.g. CWL and WDL) and (ii) workflow component composability. Workflows are ideally composed of modular building blocks and these and the workflows themselves are expected to be reused, refactored, recycled and remixed. Thus, FAIR applies "all the way down": at the specification and execution level, and for the whole workflow and each of its components. Composability also relates to reuse – that is, adapting, a workflow or its component “can be understood, modified, built upon or incorporated into other workflow”. Reuse challenges also include being able to capture and then move workflow components, dependencies, and application environments in such a way as not to affect the resulting execution of the workflow. Interoperability and Reusability present important obligations on software developers to ensure that tools and datasets are workflow ready data with clean I/O programmatic interfaces, no usage restrictions, use of community data standards, and that they are simple to install and designed for portability. Workflow developers can be both data-FAIR, by using and making identifiers, licensing data outputs, tracking data provenance and so on, and workflow-FAIR by managing versions, providing test data, and sharing libraries of composable and reusable workflow “blocks”. Communities are working on reviewing, validating and certifying canonical workflows.
While there are emerging tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists. Further work is required to understand use cases for reuse and enable reuse in the same or different environments. The FAIR principles for workflows need to be community-agreed before metrics can be considered to determine whether a workflow is FAIR, whether a workflow repository or registry is FAIR, and whether it is possible to automatically review whether a workflow’s dataflow is FAIR. Community activism, perhaps led by the platforms and registries coming together in a community group like WorkflowsRI, is needed to define principles, policies and best practices for FAIR workflows and to standardize metadata representation and collection processes. In this talk I will present current work on FAIR principles, practices and services for computational workflows, using developments in the European EOSC-Life Workflow Collaboratory and the Bioexcel Centre of Excellence.
Presenter
