Merge pull request #17 from mskcc-omics-workflows/feature/enhancement…

…s_v2 Enhancements for nucleovar v2
mskcc-omics-workflows · Nov 20, 2024 · 29df9db · 29df9db
2 parents 64c65b2 + 599e761
commit 29df9db
Show file tree

Hide file tree

Showing 32 changed files with 949 additions and 381 deletions.
diff --git a/README.md b/README.md
@@ -3,83 +3,78 @@
 [![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
-[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
+
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
+
 [![Launch on Seqera Platform](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Seqera%20Platform-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/msk/nucleovar)
 
 ## Introduction
 
 **msk/nucleovar** is a bioinformatics pipeline that ...
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+Processes a variety of sample BAM files through three variant callers (Mutect (v.1.1.5), VarDict, and GATK Mutect2). Output VCF Files are normalized, sorted, and concatenated, proceeding to be annotated and converted into a MAF format file. The following MAF file is tagged with the presence/absence of specific variant criteria, resulting in a final output MAF file containing variants filtered by criteria set forth by the ACCESS pipeline.
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
-
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+1. Read in core samplesheet (containing case and control samples) and auxillary bams samplesheet
+2. Run case and control samples through variant callers (Mutect v.1.1.5, VarDict, and GATK Mutect2)
+3. Take output VCF Files from each variant caller and normalize, sort, concatenate and annotate using BCFtools suite.
+4. Convert output VCF File into MAF file format and annotate using Genome Nexus (option provided to invoke PERL VCF2MAF script as well)
+5. Tag output MAF file using MSK ACCESS pipeline criteria (presence of hotspots and removal of specific variant annotations)
+6. Run tagged MAF file through a traceback subworkflow which tags the file with presence of genotypes and performs specific tagging and concatentation.
+7. Tag output file is run through filtering based on criteria set forth by ACCESS_filters script.
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
 First, prepare a samplesheet with your input data that looks as follows:
 
-`samplesheet.csv`:
+`core_samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+patient_id,sample_id,type,maf,duplex_bam,duplex_bai,simplex_bam,simplex_bai
+PATIENT1,SAMPLE1,case,null,path/to/duplex.bam,path/to/duplex.bai,path/to/simplex.bam,path/to/simplex.bai
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
+Each row represents an individual case and control sample.
 
--->
+`aux_bams_samplesheet.csv`:
 
-Now, you can run the pipeline using:
+```csv
+sample_id,normal_path,duplex_path,simplex_path,type
+SAMPLE1,/path/to/normal.bam,path/to/duplex.bam,path/to/simplex.bam,curated
+```
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
+Each row represents an individual sample which may contain a standard bam (if an unmatched or matched normal sample), or an individual sample which contains a simplex and duplex bam (if a curated or plasma sample)
+
+Now, you can run the pipeline using:
 
 ```bash
-nextflow run msk/nucleovar \
+nextflow run msk/nucleovar/main.nf \
+   --input core_samplesheet.csv \
+   --aux_bams aux_bams.csv \
+   --rules_json rules.json \
    -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
    --outdir <OUTDIR>
 ```
 
 > [!WARNING]
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
 > see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
 
-## Credits
+## Creditss
 
-msk/nucleovar was originally written by @buehler.
+msk/nucleovar was originally written by @rnaidu and @buehlere.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
-
 ## Contributions and Support
 
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
 
 ## Citations
 
-<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
-<!-- If you use msk/nucleovar for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
-
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
-
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
@@ -90,4 +85,9 @@ This pipeline uses code and infrastructure developed and maintained by the [nf-c
 >
 > # _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).
 
-# Page
+## Workflow Diagram
+
+![Legend](https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/docs/docs/images/legend.png)
+![Workflow Diagram Pt 1](https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/docs/docs/images/pt1.png)
+![Workflow Diagram Pt 2](https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/docs/docs/images/pt2.png)
+![Workflow Diagram Pt 3](https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/docs/docs/images/pt3.png)
diff --git a/conf/juno.config b/conf/juno.config
@@ -1,16 +1,17 @@
-    /*
-    * -------------------------------------------------
-    * Juno config
-    * -------------------------------------------------
-    * Pipeline parameters specific to running Tempo with LSF on the Juno cluster at MSKCC
-    * -------------------------------------------------
-    */
+/*
+* -------------------------------------------------
+* Juno config
+* -------------------------------------------------
+* Pipeline parameters specific to running Tempo with LSF on the Juno cluster at MSKCC
+* -------------------------------------------------
+*/
 
-    executor {
-        name = "lsf"
-    }
+executor {
+    name = "lsf"
+}
+
+process {
+    clusterOptions = ""
+    scratch=false
+}
 
-    process {
-        clusterOptions = ""
-        scratch=false
-    }
diff --git a/conf/stub.config b/conf/stub.config
@@ -29,7 +29,6 @@ params {
     target_bed = "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/bed/test.bed"
     dict = "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/genome.dict"
     sample_order_file = "https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/feature/call_variants/tests/resources/stub/stub_sample_order.txt"
-    header_file = "https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/feature/call_variants/tests/resources/stub/stub_mutect_annotate_concat_header.txt"
     blocklist = "https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/feature/call_variants/tests/resources/stub/stub_blocklist.txt"
     canonical_tx_ref = "https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/feature/call_variants/tests/resources/stub/stub_canonical_target_tx_ref.tsv"
     hotspots = "https://raw.githubusercontent.com/mskcc-omics-workflows/nucleovar/nucleovar_v2_fixes/tests/resources/stub/stub_hotspots.maf"

diff --git a/conf/test.config b/conf/test.config
@@ -28,7 +28,6 @@ params {
     canonical_bed = '/juno/cmo/access/production/resources/msk-access/v1.0/regions_of_interest/versions/v1.0/MSK-ACCESS-v1_0panelA_canonicaltargets_500buffer.bed'
     target_bed = '/juno/work/access/production/resources/msk-access/v1.0/regions_of_interest/versions/v1.0/MSK-ACCESS-v1_0panelA_canonicaltargets_500buffer.bed'
     dict = '/juno/cmo/access/production/resources/reference/current/Homo_sapiens_assembly19.dict'
-    header_file = "/juno/work/access/production/resources/nucleovar/mutect1_annotate_concat_header.txt"
     blocklist = "/juno/work/access/production/resources/nucleovar/access_blocklist.txt"
     canonical_tx_ref = "/juno/work/access/production/resources/nucleovar/canonical_target_tx_ref.tsv"
     hotspots = "/juno/work/access/production/resources/nucleovar/hotspots.maf"

diff --git a/docs/images/legend.png b/docs/images/legend.png
diff --git a/docs/images/pt1.png b/docs/images/pt1.png
diff --git a/docs/images/pt2.png b/docs/images/pt2.png
diff --git a/docs/images/pt3.png b/docs/images/pt3.png
diff --git a/docs/images/workflow_small.png b/docs/images/workflow_small.png