Release 1.1.0 (#11)

* feat: add modules * feat: add docker image * fix: swap to wave containers * feat: update schema and config * feat: add modules to workflow * fix: usage of vcf2mat * fix: remove .view and add more comments explaining the code * feat: add gvcf to vcf conversion * feat: add GATK iGenomes * feat: add tabix * update docs * update input schema * update sbwf * update schemas * add pipeline tests * feat: add correct tests * change access to index bool * prettier to hopefullly fix linting? * fix indentation * add env to local module * fix linting for first version * add nf-test and ignore external configs * wip on nf-test * finalize tests * Template update for nf-core/tools version 3.0.2 * Template update for nf-core/tools version 3.1.0 * Apply suggestions from code review * fix: filename collision by using different subsets of the filename (#4) * Fix/filenamecoll (#5) * First release :) (#1) * feat: add modules * feat: add docker image * fix: swap to wave containers * feat: update schema and config * feat: add modules to workflow * fix: usage of vcf2mat * fix: remove .view and add more comments explaining the code * feat: add gvcf to vcf conversion * feat: add GATK iGenomes * feat: add tabix * update docs * update input schema * update sbwf * update schemas * add pipeline tests * feat: add correct tests * change access to index bool * prettier to hopefullly fix linting? * fix indentation * add env to local module * fix linting for first version * add nf-test and ignore external configs * wip on nf-test * finalize tests * Template update for nf-core/tools version 3.0.2 * Template update for nf-core/tools version 3.1.0 * Apply suggestions from code review * Change famosab to qbic-pipelines after transfer (#2) * Update README.md after transfer * change famosab to qbic-pipelines * update main * fix: filename collision by using filebasename and include modules config again * fix: update snaps * prettier * add sample names to columns and restructure (#7) * add sample names and restructure * add new param * prettier * expand docs * remove weird param * bump-version to dev * prettier * tests * add concatenation of same saples with same label (#8) * add concatenation of same saples with same label * update changelog and docs * maybe we need to switch to a subway map soon * try other ci file * try other ci file * fix name * remove ci from linting * update pipeline level tests * ignore modules * add nft-vcf * update ci and snaps * modify * revert * prepare release 1.1.0 (#10) * prepare release * version correction * prettier * fix schema * remove dev Co-authored-by: Daniel Straub <[email protected]> * add forgotten explanations to output * prettier --------- Co-authored-by: Daniel Straub <[email protected]>
qbic-pipelines · Jan 8, 2025 · e93735a · e93735a
1 parent bd07ec4
commit e93735a
Show file tree

Hide file tree

Showing 46 changed files with 2,486 additions and 187 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,3 +10,11 @@ null/
 .nf-test
 .nf-test*
 .nf-test/*
+
+.vscode
+.vscode/*
+
+tests/unmergedgvcfs
+tests/unmergedgvcfs/*
+tests/input-full-ncgm.csv
+conf/test_full_ncgm.config
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -18,6 +18,7 @@ lint:
     - docs/images/nf-core-vcftomat_logo_dark.png
     - .github/ISSUE_TEMPLATE/bug_report.yml
   included_configs: false
+  actions_ci: false
   multiqc_config:
     - report_comment
   nextflow_config:
@@ -30,7 +31,7 @@ lint:
 nf_core_version: 3.1.0
 repository_type: pipeline
 template:
-  author: "Famke B\xE4uerle, Dorothy Ellis"
+  author: "Famke Bäuerle, Dorothy Ellis"
   description: Nextflow pipeline to convert (g)vcfs to matrices suitable for statistical
     analysis
   force: false
@@ -43,4 +44,4 @@ template:
     - codespaces
     - fastqc
     - adaptivecard
-  version: 1.0.0dev
+  version: 1.1.0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,18 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v1.1.0 - Newton Puccoon - 08.01.2025
+
+### Added
+
+- [#7](https://github.com/qbic-pipelines/vcftomat/pull/7) - samplenames to columns
+- [#8](https://github.com/qbic-pipelines/vcftomat/pull/8) - concat for sample, label pairs
+
+### Fixed
+
+- [#5](https://github.com/qbic-pipelines/vcftomat/pull/5) - filename collision
+- [#10](https://github.com/qbic-pipelines/vcftomat/pull/10) - prepare release 1.1.0
+
 ## v1.0.0 - Curie Purpureal - 16.12.2024
 
 Initial release of qbic-pipelines/vcftomat, created with the [nf-core](https://nf-co.re/) template.
diff --git a/README.md b/README.md
@@ -16,9 +16,11 @@
 
 1. Indexes (g.)vcf files ([`tabix`](http://www.htslib.org/doc/tabix.html))
 2. Converts g.vcf files to vcf with `genotypegvcf` ([`GATK`](https://gatk.broadinstitute.org/hc/en-us))
-3. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
-4. Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro ([`R`](https://www.r-project.org/))
-5. Collects all reports into a MultiQC report ([`MultiQC`](http://multiqc.info/))
+3. Concatenates all vcfs that have the same id and the same label with `bcftools/concat` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
+4. Changes the sample name in the vcf file to the filename with `bcftools/reheader` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - This can be turned off by adding `--rename false` to the `nextflow run` command.
+5. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
+6. Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro ([`R`](https://www.r-project.org/))
+7. Collects all reports into a MultiQC report ([`MultiQC`](http://multiqc.info/))
 
 ![](./docs/images/vcftomat.excalidraw.png)
 
@@ -32,13 +34,14 @@ First, prepare a samplesheet with your input data that looks as follows:
 `samplesheet.csv`:
 
 ```csv
-sample,gvcf,vcf_path,vcf_index_path
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+sample,label,gvcf,vcf_path,vcf_index_path
+SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
 ```
 
-Each row represents a VCF file coming from a sample. The `gvcf` column indicates whether the file is a g.vcf file or not. The `vcf_path` and `vcf_index_path` columns contain the path to the VCF file and its index, respectively.
+Each row represents a VCF file coming from a sample. The `label` column enables concatenation of vcfs (for example when the pipeline produces different vcfs for chrM and chrY). The `gvcf` column indicates whether the file is a g.vcf file or not. The `vcf_path` and `vcf_index_path` columns contain the path to the VCF file and its index, respectively.
 
 Now, you can run the pipeline using:
 

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,6 +1,6 @@
 report_comment: >
-  This report has been generated by the <a href="https://github.com/qbic-pipelines/vcftomat/releases/tag/1.0.0" target="_blank">qbic-pipelines/vcftomat</a>
-  analysis pipeline.
+  This report has been generated by the <a href="https://github.com/qbic-pipelines/vcftomat/releases/tag/1.1.0"
+  target="_blank">qbic-pipelines/vcftomat</a> analysis pipeline.
 report_section_order:
   "qbic-pipelines-vcftomat-methods-description":
     order: -1000

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,4 +1,5 @@
-sample,gvcf,vcf_path,vcf_index_path
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+sample,label,gvcf,vcf_path,vcf_index_path
+SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -13,6 +13,12 @@
                 "errorMessage": "Sample name must be provided and cannot contain spaces",
                 "meta": ["id"]
             },
+            "label": {
+                "type": "string",
+                "pattern": "^\\S+$",
+                "errorMessage": "Label must be provided and cannot contain spaces",
+                "meta": ["label"]
+            },
             "gvcf": {
                 "type": "boolean",
                 "errorMessage": "",
@@ -40,6 +46,6 @@
                 "errorMessage": "Index of VCF file must have extension '.tbi'- Optional"
             }
         },
-        "required": ["sample", "gvcf", "vcf_path"]
+        "required": ["sample", "label", "gvcf", "vcf_path"]
     }
 }
diff --git a/conf/modules.config b/conf/modules.config
@@ -23,13 +23,38 @@ process {
     }
 
     withName: 'GATK4_GENOTYPEGVCFS' {
-        ext.prefix = { "${input.baseName.tokenize('.')[0]}" }
+        ext.prefix = { "${meta.name}" }
+    }
+
+    withName: 'BCFTOOLS_CONCAT' {
+        memory     = 8.GB
+        ext.prefix = { "${meta.label}.concat" }
+        ext.args   = { " --allow-overlaps --output-type z --write-index=tbi" }
+        publishDir = [
+                mode: params.publish_dir_mode,
+                path: { "${params.outdir}/bcftools/concat/" },
+            ]
+    }
+
+    withName: 'BCFTOOLS_REHEADER' {
+        beforeScript = { "echo ${meta.label} > ${meta.label}.txt" }
+        ext.args     = { "--samples ${meta.label}.txt" }
+        ext.prefix   = { "${meta.label}.reheader" }
+        ext.args2    = { "--output-type z --write-index=tbi" }
+        publishDir = [
+                mode: params.publish_dir_mode,
+                path: { "${params.outdir}/bcftools/reheader/" },
+            ]
     }
 
     withName: 'BCFTOOLS_MERGE' {
-        memory = 8.GB
-        ext.args = { '--force-samples' }
-        ext.prefix = { "${meta.id}.merged" }
+        memory     = 8.GB
+        ext.args   = { "--force-samples --output-type z --write-index=tbi" }
+        ext.prefix = { "${meta.id}.merge" }
+        publishDir = [
+                mode: params.publish_dir_mode,
+                path: { "${params.outdir}/bcftools/merge/" },
+            ]
     }
 
     withName: 'MULTIQC' {

diff --git a/docs/images/vcftomat.excalidraw.png b/docs/images/vcftomat.excalidraw.png
diff --git a/docs/output.md b/docs/output.md
@@ -6,27 +6,43 @@ This document describes the output produced by the pipeline. Most of the plots a
 
 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
-<!-- TODO nf-core: Write this documentation describing your workflow's output -->
-
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
 - [Tabix](#tabix) - Indexes (g.)vcf files
 - [GenotypeGVCFs](#genotypegvcfs) - Converts g.vcf files to vcf with GATK
+- [Concatenate VCFs](#concatenate-vcfs) - Concatenates all vcfs that have the same id and the same label with bcftools/concat
+- [Rename Samples](#rename-samples) - Changes the sample name in the vcf file to the label with bcftools/reheader
 - [Merge VCFs](#merge-vcfs) - Merges all vcfs from the same sample with bcftools/merge
 - [Convert to matrix](#convert-to-matrix) - Converts the (merged) vcfs to a matrix using a custom R script written by @ellisdoro
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
 ### Tabix
 
+Tabix generated index files with `.tbi` extension for all `(g).vcf` files that are given to the pipeline without index.
+
 ### GenotypeGVCFs
 
+The GATK GenotypeGVCFs module translates genotype (g) vcf files into classic vcf files. The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not.
+
+### Concatenate VCFs
+
+Some variant calling pipelines will return multiple (g)VCF files for one patient. The `concatenate` function of `bcftools` is used to add these VCFs to one VCF.
+
+### Rename Samples
+
+To make enable the comparison of the finalized CSV files, `bcftools reheader` can be enabled to rename the variant sample name from the generic name given by the variant caller to a custom label given with the samplesheet.
+
 ### Merge VCFs
 
+To enable comparison of different variant callers or variant calling pipelines, all VCFs that come from the same sample are merged based on the sample ID submitted by the user.
+
 ### Convert to matrix
 
+A custom R script is used to convert the finalized VCF to a CSV which can be used for further downstream analysis. Script was written by [Dorothy Ellis](https://github.com/ellisdoro).
+
 ### MultiQC
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -19,15 +19,17 @@ You will need to create a samplesheet with information about the samples you wou
 The `sample` identifiers have to be the same when the vcfs originate from the same bam but were yielded with different callers. The pipeline will merge all vcfs from the same sample into one vcf file but is also able to handle if there is only one vcf file for a sample (merging will then be skipped).
 
 ```csv title="samplesheet.csv"
-sample,gvcf,vcf_path,vcf_index_path
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-1,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
-SAMPLE-2,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+sample,label,gvcf,vcf_path,vcf_index_path
+SAMPLE-1,pipelineA-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-1,pipelineB-callerA,false,path/to/vcf.gz,path/to/.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
+SAMPLE-2,pipelineB-callerB,true,path/to/g.vcf.gz,path/to/g.vcf.gz.tbi
 ```
 
 | Column           | Description                                                                                                                                                                                                  |
 | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `sample`         | Custom sample name. This entry will be identical for vcfs that originate from the same bam but were yielded with different callers. Spaces in sample names are automatically converted to underscores (`_`). |
+| `label`          | Label for the vcf file. This is used to concatenate vcfs with the same label.                                                                                                                                |
 | `gvcf`           | Boolean whether the supplied sample is a gvcf (true) or a normal vcf (false).                                                                                                                                |
 | `vcf_path`       | Full path to VCF file, should have the extension ".g.vcf.gz", ".vcf.gz", ".g.vcf" or ".vcf".                                                                                                                 |
 | `vcf_index_path` | Full path to index of (g)VCF file. Optional. Should have extension ".tbi".                                                                                                                                   |
@@ -39,7 +41,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 The typical command for running the pipeline is as follows:
 
 ```bash
-nextflow run qbic-pipelines/vcftomat --input ./samplesheet.csv --outdir ./results --genome GATK.GRCh38 -profile docker
+nextflow run qbic-pipelines/vcftomat --input ./samplesheet.csv --outdir ./results --genome GATK.GRCh38 --rename true -profile docker
 ```
 
 This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
@@ -69,10 +71,9 @@ nextflow run qbic-pipelines/vcftomat -profile docker -params-file params.yaml
 with:
 
 ```yaml title="params.yaml"
-input: './samplesheet.csv'
-outdir: './results/'
-genome: 'GATK.GRCh38'
-<...>
+input: "./samplesheet.csv"
+outdir: "./results/"
+genome: "GATK.GRCh38"
 ```
 
 You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).

diff --git a/modules.json b/modules.json
@@ -5,11 +5,21 @@
         "https://github.com/nf-core/modules.git": {
             "modules": {
                 "nf-core": {
+                    "bcftools/concat": {
+                        "branch": "master",
+                        "git_sha": "d1e0ec7670fa77905a378627232566ce54c3c26d",
+                        "installed_by": ["modules"]
+                    },
                     "bcftools/merge": {
                         "branch": "master",
                         "git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
                         "installed_by": ["modules"]
                     },
+                    "bcftools/reheader": {
+                        "branch": "master",
+                        "git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
+                        "installed_by": ["modules"]
+                    },
                     "gatk4/genotypegvcfs": {
                         "branch": "master",
                         "git_sha": "1999eff2c530b2b185a25cc42117a1686f09b685",

diff --git a/modules/nf-core/bcftools/concat/environment.yml b/modules/nf-core/bcftools/concat/environment.yml
diff --git a/modules/nf-core/bcftools/concat/main.nf b/modules/nf-core/bcftools/concat/main.nf