IMPaCT-Data is the IMPaCT program that aims to support the development of a common, interoperable and integrated system for the collection and analysis of clinical and molecular data by providing the knowledge and resources available in the Spanish Science and Technology System. This development will make it possible to answer research questions based on the different clinical and molecular information systems available. Fundamentally, it aims to provide researchers with a population perspective based on individual data.
The IMPaCT-Data project is divided into different work packages (WP). In the context of IMPaCT-Data WP3 (Genomics), a working group of experts worked on the generation of a specific quality control (QC) workflow for germline exome samples.
To achieve this, a set of metrics related to human genomic data was decided upon, and the toolset or software to extract these metrics was implemented in an existing variant calling workflow called Sarek, part of the nf-core community. The final outcome is a Nextflow subworkflow, called IMPaCT-QC implemented in the Sarek pipeline.
In the context of IMPaCT-Data WP3 (Genomics), a working group of experts worked towards the generation of a quality control (QC) workflow specific for germline exome samples. The final product consists of a modified version of the well-known Nextflow workflow Sarek. This pipeline has been modified adding new QC metrics and tools to obtain a broader sight of the quality details of the data and the pipeline. All the new implemented features were selected by this group of experts within the IMPaCT-Data.
The goal of the IMPaCT-Data WP3 (Genomics), was to modify the Sarek workflow in order to implement new metrics and obtain a pipeline able to detect variants and also with a really good and deep QC.
To achieve this, a set of metrics related to human genomic data was decided upon, and the toolset or software to extract these metrics was implemented in an existing variant calling workflow called Sarek, part of the nf-core community.
nf-core/sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome. Sarek can also handle tumour/normal pairs and could include additional relapses.
Aligned with the FAIR principles, the pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
For all these reasons, IMPaCT-Data decided to use this existing workflow and implement the new QC pipeline in it. Merging all the excellent characteristics from both parts to create a product able to help all scientists.
Follow these instructions from Sarek, usage and usage.md, and from IMPaCT QC subworkflow, follow usage.md.
For more details about the output files and reports, please refer to the output folder. To see the results of an example test run with a test dataset refer to the results folder.
To perform a test with a test dataset see the tests folder.
Take into account that all the modules in the IMPaCT QC workflow can be disabled by adding the name of the module in the --skip_tools
parameter. Note that adding impactqc
will skip the full IMPaCT QC workflow (skips all modules).
- collectinsertsizemetrics
- collecthsmetrics
- fastqscreen
- flagstat
- imapctqc
- sexdeterrmine
- somalier
Link to the supporting documentation associated to the QC metrics added in this workflow: