diff --git a/CHANGELOG.md b/CHANGELOG.md
index 78c9608c..6be5da28 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -56,6 +56,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#423](https://github.com/genomic-medicine-sweden/nallo/pull/423) - Updated metro map
- [#428](https://github.com/genomic-medicine-sweden/nallo/pull/428) - Changed from using bcftools to SVDB for SV merging
- [#431](https://github.com/genomic-medicine-sweden/nallo/pull/431) - Changed `CITATIONS.md` to `docs/CITATIONS.md`,
+- [#433](https://github.com/genomic-medicine-sweden/nallo/pull/433) - Updated docs and README.
### `Removed`
diff --git a/README.md b/README.md
index 4d0856da..d038d89d 100644
--- a/README.md
+++ b/README.md
@@ -11,8 +11,6 @@
**genomic-medicine-sweden/nallo** is a bioinformatics analysis pipeline for long-reads from both PacBio and (targeted) ONT-data, focused on rare-disease. Heavily influenced by best-practice pipelines such as [nf-core/sarek](https://nf-co.re/sarek), [nf-core/raredisease](https://nf-co.re/raredisease), [nf-core/nanoseq](https://github.com/nf-core/nanoseq), [PacBio Human WGS Workflow](https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake), [epi2me-labs/wf-human-variation](https://github.com/epi2me-labs/wf-human-variation) and [brentp/rare-disease-wf](https://github.com/brentp/rare-disease-wf).
-## Overview
-
@@ -78,17 +76,7 @@ nextflow run genomic-medicine-sweden/nallo \
--outdir
```
-For more details and further functionality, please refer to the [usage documentation](https://github.com/genomic-medicine-sweden/nallo/blob/dev/docs/usage.md).
-
-> [!WARNING]
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
-> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
-
-To run in an offline environment, download the pipeline and singularity images using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):
-
-```
-nf-core download genomic-medicine-sweden/nallo
-```
+For more details and further functionality, please refer to the [documentation](http://genomic-medicine-sweden.github.io/nallo/).
## Credits
diff --git a/docs/index.md b/docs/index.md
index 1413a087..0b06c33e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -7,24 +7,22 @@ description: A bioinformatics analysis pipeline for long-reads from both PacBio
**genomic-medicine-sweden/nallo** is a bioinformatics analysis pipeline for long-reads from both PacBio and (targeted) ONT-data, focused on rare-disease. Heavily influenced by best-practice pipelines such as [nf-core/sarek](https://nf-co.re/sarek), [nf-core/raredisease](https://nf-co.re/raredisease), [nf-core/nanoseq](https://github.com/nf-core/nanoseq), [PacBio Human WGS Workflow](https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake), [epi2me-labs/wf-human-variation](https://github.com/epi2me-labs/wf-human-variation) and [brentp/rare-disease-wf](https://github.com/brentp/rare-disease-wf).
-## Overview
-
## Pipeline summary
-##### QC
+### QC
- Read QC with [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [cramino](https://github.com/wdecoster/cramino) and [mosdepth](https://github.com/brentp/mosdepth)
-##### Alignment & assembly
+### Alignment & assembly
- Align reads to reference with [minimap2](https://github.com/lh3/minimap2)
- Assemble (trio-binned) haploid genomes with [hifiasm](https://github.com/chhylp123/hifiasm) (HiFi only)
-##### Variant calling
+### Variant calling
- Call SNVs & joint genotyping with [deepvariant](https://github.com/google/deepvariant) and [GLNexus](https://github.com/dnanexus-rnd/GLnexus)
- Call SVs with [Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles2](https://github.com/fritzsedlazeck/Sniffles)
@@ -33,25 +31,26 @@ description: A bioinformatics analysis pipeline for long-reads from both PacBio
- Call paralogous genes with [Paraphase](https://github.com/PacificBiosciences/paraphase)
- Call variants from assembly with [dipcall](https://github.com/lh3/dipcall) (HiFi only)
-##### Phasing and methylation
+### Phasing and methylation
- Phase and haplotag reads with [LongPhase](https://github.com/twolinin/longphase), [whatshap](https://github.com/whatshap/whatshap) or [HiPhase](https://github.com/PacificBiosciences/HiPhase)
- Create methylation pileups with [modkit](https://github.com/nanoporetech/modkit)
-##### Annotation
+### Annotation
- Annotate SNVs and INDELs with databases of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. with [echtvar](https://github.com/brentp/echtvar) and [VEP](https://github.com/Ensembl/ensembl-vep)
- Annotate repeat expansions with [stranger](https://github.com/Clinical-Genomics/stranger)
- Annotate SVs with [SVDB](https://github.com/J35P312/SVDB) and [VEP](https://github.com/Ensembl/ensembl-vep)
-##### Ranking
+### Ranking
- Rank SNVs with [GENMOD](https://github.com/Clinical-Genomics/genmod)
## Usage
-> [!NOTE]
-> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
+!!! note
+
+ If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
Prepare a samplesheet with input data:
@@ -74,11 +73,9 @@ nextflow run genomic-medicine-sweden/nallo \
--outdir
```
-For more details and further functionality, please refer to the [usage documentation](https://github.com/genomic-medicine-sweden/nallo/blob/dev/docs/usage.md).
+!!!warning
-> [!WARNING]
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
-> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
+ Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
To run in an offline environment, download the pipeline and singularity images using [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use):
@@ -86,6 +83,8 @@ To run in an offline environment, download the pipeline and singularity images u
nf-core download genomic-medicine-sweden/nallo
```
+For more details and further functionality, please refer to the [usage documentation](usage.md).
+
## Credits
genomic-medicine-sweden/nallo was originally written by Felix Lenner.
@@ -94,7 +93,7 @@ We thank the following people for their extensive assistance in the development
## Contributions and Support
-If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
+If you would like to contribute to this pipeline, please see the [contributing guidelines](https://github.com/genomic-medicine-sweden/nallo/blob/dev/.github/CONTRIBUTING.md).
## Citations
diff --git a/docs/output.md b/docs/output.md
index a96aff02..9e09cfec 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -1,366 +1,238 @@
# genomic-medicine-sweden/nallo: Output
-## Table of contents
-
-- [genomic-medicine-sweden/nallo: Output](#genomic-medicine-swedennallo-output)
- - [Table of contents](#table-of-contents)
- - [Pipeline overview](#pipeline-overview)
- - [Alignment](#alignment)
- - [Assembly](#assembly)
- - [Assembly variant calling](#assembly-variant-calling)
- - [CNV calling](#cnv-calling)
- - [Methylation](#methylation)
- - [MultiQC](#multiqc)
- - [Paraphase](#paraphase)
- - [Phasing](#phasing)
- - [Pipeline information](#pipeline-information)
- - [QC](#qc)
- - [FastQC](#fastqc)
- - [Mosdepth](#mosdepth)
- - [Cramino](#cramino)
- - [Somalier](#somalier)
- - [Repeat calling](#repeat-calling)
- - [Repeat annotation](#repeat-annotation)
- - [SNVs](#snvs)
- - [Calling](#calling)
- - [Annotation](#annotation)
- - [Ranking](#ranking)
- - [Ranked Variants](#ranked-variants)
- - [SV Calling](#sv-calling)
- - [SV Annotation](#sv-annotation)
+## Aligned reads
-## Pipeline overview
+[Minimap2](https://github.com/lh3/minimap2) is used to map the reads to a reference genome. The aligned reads are sorted, (merged) and indexed using [samtools](https://github.com/samtools/samtools).
-The directories listed below will be created in the results directory after the pipeline has finished:
+| Path | Description |
+| --------------------------------------- | ----------------------------------- |
+| `aligned_reads/minimap2/{sample}/*.bam` | Alignment file in bam format |
+| `aligned_reads/minimap2/{sample}/*.bai` | Index of the corresponding bam file |
-- `aligned_reads`
-- `assembly_haplotypes`
-- `assembly_variant_calling`
-- `cnv_calling`
-- `databases`
-- `methylation`
-- `multiqc`
-- `paraphase`
-- `pedigree`
-- `phasing`
-- `pipeline_info`
-- `qc`
-- `repeat_calling`
-- `snvs`
-- `svs`
+If the pipeline is run with phasing, the aligned reads will be happlotagged using the active phasing tool.
-### Alignment
+| Path | Description |
+| ----------------------------------------------------------------- | ----------------------- |
+| `{outputdir}/aligned_reads/{sample}/{sample}_haplotagged.bam` | BAM file with haplotags |
+| `{outputdir}/aligned_reads/{sample}/{sample}_haplotagged.bam.bai` | Index of the BAM file |
-[minimap2](https://github.com/lh3/minimap2) is used to map the reads to a reference genome. The aligned reads are sorted, (merged) and indexed using [samtools](https://github.com/samtools/samtools).
+!!!note
-
-Output files from Alignment
+ Alignments will only be output without haplotags if phasing is off.
-- `{outputdir}/aligned_reads/minimap2/{sample}/`
- - `*.bam`: Alignment file in bam format
- - `*.bai`: Index of the corresponding bam file
-
+## Assembly
-### Assembly
+[Hifiasm](https://github.com/chhylp123/hifiasm) is used to assemble genomes. The assembled haplotypes are then comverted to fasta files using [gfastats](https://github.com/vgl-hub/gfastats). A deconstructed version of [dipcall](https://github.com/lh3/dipcall) is to map the assembled haplotypes back to the reference genome.
-[hifiasm](https://github.com/chhylp123/hifiasm) is used to assemble genomes. The assembled haplotypes are then comverted to fasta files using [gfastats](https://github.com/vgl-hub/gfastats).
+| Path | Description |
+| ------------------------------------------------------------ | ---------------------------------------------------- |
+| `assembly_haplotypes/gfastats/{sample}/*hap1.p_ctg.fasta.gz` | Assembled haplotype 1 |
+| `assembly_haplotypes/gfastats/{sample}/*hap2.p_ctg.fasta.gz` | Assembled haplotype 2 |
+| `assembly_haplotypes/gfastats/{sample}/*.assembly_summary` | Summary statistics |
+| `assembly_variant_calling/dipcall/{sample}/*hap1.bam` | Assembled haplotype 1 mapped to the reference genome |
+| `assembly_variant_calling/dipcall/{sample}/*hap1.bai` | Index of the corresponding BAM file for haplotype 1 |
+| `assembly_variant_calling/dipcall/{sample}/*hap2.bam` | Assembled haplotype 2 mapped to the reference genome |
+| `assembly_variant_calling/dipcall/{sample}/*hap2.bai` | Index of the corresponding BAM file for haplotype 2 |
-
-Output files from Assembly
+## Methylation pileups
-- `{outputdir}/assembly_haplotypes/gfastats/{sample}/`
- - `*hap1.p_ctg.fasta.gz`: Assembled haplotype 1
- - `*hap2.p_ctg.fasta.gz`: Assembled haplotype 2
- - `*.assembly_summary`: Summary statistics
-
+[Modkit](https://github.com/nanoporetech/modkit) is used to create methylation pileups, producing bedMethyl files for both haplotagged and ungrouped reads. Additionaly, methylation information can be viewed in the BAM files, for example in IGV.
-### Assembly variant calling
+| Path | Description |
+| ----------------------------------------------------------------------------------- | --------------------------------------------------------- |
+| `methylation/modkit/pileup/phased/{sample}/*.modkit_pileup_phased_*.bed.gz` | bedMethyl file with summary counts from haplotagged reads |
+| `methylation/modkit/pileup/phased/{sample}/*.modkit_pileup_phased_ungrouped.bed.gz` | bedMethyl file for ungrouped reads |
+| `methylation/modkit/pileup/unphased/{sample}/*.modkit_pileup.bed.gz` | bedMethyl file with summary counts from all reads |
+| `methylation/modkit/pileup/unphased/{sample}/*.bed.gz.tbi` | Index of the corresponding bedMethyl file |
-A deconstructed version of [dipcall](https://github.com/lh3/dipcall) is used to call variants from the assembled haplotypes. They are also mapped back to the reference genome.
+## MultiQC
-
-Output files from Assembly variant calling
+[MultiQC](http://multiqc.info) generates an HTML report summarizing all samples' QC results and pipeline statistics.
-> Dipcall produces several files, a full expanation is available [here](https://github.com/lh3/dipcall).
+| Path | Description |
+| ----------------------------- | ----------------------------------------- |
+| `multiqc/multiqc_report.html` | HTML report summarizing QC results |
+| `multiqc/multiqc_data/` | Directory containing parsed statistics |
+| `multiqc/multiqc_plots/` | Directory containing static report images |
-- `{outputdir}/assembly_variant_calling/dipcall/{sample}/`
+## Pipeline Information
- - `*hap1.bam`: Assembled haplotype 1 mapped to the reference genome
- - `*hap1.bai`: Index of the corresponding bam file.
- - `*hap2.bam`: Assembled haplotype 2 mapped to the reference genome
- - `*hap2.bai`: Index of the corresponding bam file.
+[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) generates reports for troubleshooting, performance, and traceability.
-
+| Path | Description |
+| --------------------------------------- | --------------------------------- |
+| `pipeline_info/execution_report.html` | Execution report |
+| `pipeline_info/execution_timeline.html` | Timeline report |
+| `pipeline_info/execution_trace.txt` | Execution trace |
+| `pipeline_info/pipeline_dag.dot` | Pipeline DAG in DOT format |
+| `pipeline_info/pipeline_report.html` | Pipeline report |
+| `pipeline_info/software_versions.yml` | Software versions used in the run |
-### CNV calling
+## Phasing
-[HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV) is used to call CNVs. It also produces copynumber, depth and MAF tracks loadable in IGV.
+[LongPhase](https://github.com/twolinin/longphase), [WhatsHap](https://whatshap.readthedocs.io/en/latest/), or [HiPhase](https://github.com/PacificBiosciences/HiPhase) are used for phasing.
-
-Output files from CNV calling
+| Path | Description |
+| ----------------------------------------------------------------- | ----------------------------- |
+| `{outputdir}/aligned_reads/{sample}/{sample}_haplotagged.bam` | BAM file with haplotags |
+| `{outputdir}/aligned_reads/{sample}/{sample}_haplotagged.bam.bai` | Index of the BAM file |
+| `{outputdir}/phased_variants/{sample}/*.vcf.gz` | VCF file with phased variants |
+| `{outputdir}/phased_variants/{sample}/*.vcf.gz.tbi` | Index of the VCF file |
+| `{outputdir}/qc/phasing_stats/{sample}/*.blocks.tsv` | Phase block file |
+| `{outputdir}/qc/phasing_stats/{sample}/*.stats.tsv` | Phasing statistics file |
-- `{outputdir}/cnv_calling/hificnv/{sample}/`
- - `*.copynum.bedgraph`: Copy number in bedgraph format
- - `*.depth.bw`: Depth track in BigWig format
- - `*.maf.bw`: Minor allele frequencies in BigWig format
- - `*.vcf.gz`: VCF file containing CNV variants
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
-
+## QC
-### Methylation
+[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [cramino](https://github.com/wdecoster/cramino), [mosdepth](https://github.com/brentp/mosdepth), and [somalier](https://github.com/brentp/somalier) are used for read quality control.
-[modkit](https://github.com/nanoporetech/modkit) is used to create methylation pileups. bedMethyl files are stored both one file with summary counts from reads per haplotag (e.g. HP1, HP2 and ungrouped) and one file with summary counts from all reads. The methylation is also stored in the BAM files and can be viewed directly in IGV.
+### FastQC
-
-Output files from Methylation
+[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides general quality metrics for sequenced reads, including information on quality score distribution, per-base sequence content (%A/T/G/C), adapter contamination, and overrepresented sequences. For more details, refer to the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
-- `{outputdir}/methylation/modkit/pileup/phased/{sample}/`
+| Path | Description |
+| ---------------------------------------------- | --------------------------------------------------------------- |
+| `{outputdir}/qc/fastqc/{sample}/*_fastqc.html` | FastQC report containing quality metrics |
+| `{outputdir}/qc/fastqc/{sample}/*_fastqc.zip` | Zip archive with the FastQC report, data files, and plot images |
- - `*.modkit_pileup_phased_*.bed.gz`: bedMethyl file containing summary counts from reads with haplotags, e.g. 1 or 2
- - `*.modkit_pileup_phased_ungrouped.bed.gz`: bedMethyl file containing summary counts for ungrouped reads
- - `*.bed.gz.tbi`: Index of the corresponding bedMethyl file
+### Mosdepth
-- `{outputdir}/methylation/modkit/pileup/unphased/{sample}/`
- - `*.modkit_pileup.bed.gz`: bedMethyl file containing summary counts from all reads
- - `*.bed.gz.tbi`: Index of the corresponding bedMethyl file
-
+[Mosdepth](https://github.com/brentp/mosdepth) is used to report quality control metrics such as coverage and GC content from alignment files.
-### MultiQC
+| Path | Description |
+| ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
+| `{outputdir}/qc/mosdepth/{sample}/*.mosdepth.global.dist.txt` | Cumulative distribution of bases covered for at least a given coverage value, across chromosomes and the whole genome |
+| `{outputdir}/qc/mosdepth/{sample}/*.mosdepth.region.dist.txt` | Cumulative distribution of bases covered for at least a given coverage value, across regions (if a BED file is used) |
+| `{outputdir}/qc/mosdepth/{sample}/*.mosdepth.summary.txt` | Mosdepth summary file |
+| `{outputdir}/qc/mosdepth/{sample}/*.regions.bed.gz` | Depth per region (if a BED file is used) |
+| `{outputdir}/qc/mosdepth/{sample}/*.regions.bed.gz.csi` | Index of the regions.bed.gz file |
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+### Cramino
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see .
+[cramino](https://github.com/wdecoster/cramino) is used to analyze both phased and unphased reads.
-
-Output files
+| Path | Description |
+| -------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| `{outputdir}/qc/cramino/phased/{sample}/*.arrow` | Read length and quality in [Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html) format |
+| `{outputdir}/qc/cramino/phased/{sample}/*.txt` | Summary information in text format |
+| `{outputdir}/qc/cramino/unphased/{sample}/*.arrow` | Read length and quality in [Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html) format |
+| `{outputdir}/qc/cramino/unphased/{sample}/*.txt` | Summary information in text format |
-- `{outputdir}/multiqc/`
- - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- - `multiqc_plots/`: directory containing static images from the report in various formats.
-
+### Somalier
-### Paraphase
+[somalier](https://github.com/brentp/somalier) checks relatedness and sex.
-[Paraphase](https://github.com/PacificBiosciences/paraphase) is used to call paralogous genes. For interpreting the output, see .
+| Path | Description |
+| ---------------------------------------------------------------- | ------------------------------------------- |
+| `{outputdir}/predigree/{project}.ped` | PED file updated with somalier-inferred sex |
+| `{outputdir}/qc/somalier/relate/{project}/{project}.html` | HTML report |
+| `{outputdir}/qc/somalier/relate/{project}/{project}.pairs.tsv` | Information about sample pairs |
+| `{outputdir}/qc/somalier/relate/{project}/{project}.samples.tsv` | Information about individual samples |
-
-Output files from Paraphase
+## Variants
-- `{outputdir}/paraphase/{sample}/`
- - `*.bam`: BAM file with haplotypes grouped by HP and colored by YC
- - `*.bai`: Index of the corresponding bam file.
- - `*.json`: Output file summarizing haplotypes and variant calls
- - `{sample}_paraphase_vcfs/`:
- - `{sample}_{gene}_vcf`: VCF file per gene
- - `{sample}_{gene}_vcf.tbi`: Index of the corresponding VCF file
-
+### CNVs
-### Phasing
+[HiFiCNV](https://github.com/PacificBiosciences/HiFiCNV) is used to call CNVs, producing copy number, depth, and MAF tracks for IGV.
-[LongPhase](https://github.com/twolinin/longphase), [WhatsHap](https://whatshap.readthedocs.io/en/latest/) or [HiPhase](https://github.com/PacificBiosciences/HiPhase) are used to phase variants and haplotag reads.
+| Path | Description |
+| ------------------------------------------------- | ----------------------------------------- |
+| `cnv_calling/hificnv/{sample}/*.copynum.bedgraph` | Copy number in bedgraph format |
+| `cnv_calling/hificnv/{sample}/*.depth.bw` | Depth track in BigWig format |
+| `cnv_calling/hificnv/{sample}/*.maf.bw` | Minor allele frequencies in BigWig format |
+| `cnv_calling/hificnv/{sample}/*.vcf.gz` | VCF file containing CNV variants |
+| `cnv_calling/hificnv/{sample}/*.vcf.gz.tbi` | Index of the corresponding VCF file |
-
-Output files from phasing
+### Paralogous genes
-- `{outputdir}/aligned_reads/{sample}/`
- - `{sample}_haplotagged.bam`: BAM file with haplotags
- - `{sample}_haplotagged.bam.bai`: Index of the corresponding bam file
-- `{outputdir}/phased_variants/{sample}/`
- - `*.vcf.gz`: VCF file with phased variants
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/qc/phasing_stats/{sample}/`
- - `*.blocks.tsv`: File with phase blocks
- - `*.stats.tsv`: File with phasing statistics
-
+[Paraphase](https://github.com/PacificBiosciences/paraphase) is used to call paralogous genes.
-### Pipeline information
+| Path | Description |
+| ----------------------------------------------------------- | --------------------------------------- |
+| `paraphase/{sample}/*.bam` | BAM file with haplotypes grouped by HP |
+| `paraphase/{sample}/*.bai` | Index of the BAM file |
+| `paraphase/{sample}/*.json` | Summary of haplotypes and variant calls |
+| `paraphase/{sample}_paraphase_vcfs/{sample}_{gene}_vcf` | VCF file per gene |
+| `paraphase/{sample}_paraphase_vcfs/{sample}_{gene}_vcf.tbi` | Index of the VCF file |
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
+### Repeats
-
-Output files
+[TRGT](https://github.com/PacificBiosciences/trgt) is used to call repeats:
-- `pipeline_info/`
- - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- - Parameters used by the pipeline run: `params.json`.
+| Path | Description |
+| --------------------------------------------------------------------- | ----------------------------------------- |
+| `{outputdir}/repeat_calling/trgt/multi_sample/{project}/*.vcf.gz` | Merged VCF file for all samples |
+| `{outputdir}/repeat_calling/trgt/multi_sample/{project}/*.vcf.gz.tbi` | Index of the VCF file |
+| `{outputdir}/repeat_calling/trgt/single_sample/{sample}/*.vcf.gz` | VCF file with called repeats for a sample |
+| `{outputdir}/repeat_calling/trgt/single_sample/{sample}/*.vcf.gz.tbi` | Index of the VCF file |
+| `{outputdir}/repeat_calling/trgt/single_sample/{sample}/*.bam` | BAM file with sorted spanning reads |
+| `{outputdir}/repeat_calling/trgt/single_sample/{sample}/*.bai` | Index of the BAM file |
-
+[Stranger](https://github.com/Clinical-Genomics/stranger) is used to annotate them:
-### QC
-
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [cramino](https://github.com/wdecoster/cramino), [mosdepth](https://github.com/brentp/mosdepth) and [somalier](https://github.com/brentp/somalier) are used for read QC.
-
-##### FastQC
-
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
-
-
-Output files
-
-- `{outputdir}/qc/fastqc/{sample}/`
- - `*_fastqc.html`: FastQC report containing quality metrics.
- - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
-
-
-##### Mosdepth
-
-[Mosdepth](https://github.com/brentp/mosdepth) is used to report quality control metrics such as coverage, and GC content from alignment files.
-
-
-Output files from Mosdepth
-
-- `{outputdir}/qc/mosdepth/{sample}`
- - `*.mosdepth.global.dist.txt`: This file contains a cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value across each chromosome and the whole genome
- - `*.mosdepth.region.dist.txt`: This file contains a cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value across each region, is output if running the pipeline with a BED-file
- - `*.mosdepth.summary.txt`: Mosdepth ummary file
- - `*.regions.bed.gz`: Depth per region, is output if running the pipeline with a BED-file
- - `*.regions.bed.gz.csi`: Index of regions.bed.gz
-
-
-##### Cramino
-
-[cramino](https://github.com/wdecoster/cramino) is run on both phased and unphased reads.
-
-
-Output files from Cramino
-
-- `{outputdir}/qc/cramino/phased/{sample}`
- - `*.arrow`: Read length and quality in [Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html) format
- - `*.txt`: Summary information in text format
-- `{outputdir}/qc/cramino/unphased/{sample}`
- - `*.arrow`: Read length and quality in [Apache Arrow](https://arrow.apache.org/docs/format/Columnar.html) format
- - `*.txt`: Summary information in text format
-
-
-##### Somalier
-
-[somalier](https://github.com/brentp/somalier) is used to check relatedness and sex.
-
-
-Output files from Somalier
-
-- `{outputdir}/predigree/{project}.ped`: A PED file with updated from somalier sex
-- `{outputdir}/qc/somalier/relate/{project}/`
- - `{project}.html`: HTML report
- - `{project}.pairs.tsv`: Output information in sample pairs
- - `{project}.samples.tsv`: Output information per sample
-
-
-### Repeat calling
-
-[TRGT](https://github.com/PacificBiosciences/trgt) is used to call repeats.
-
-
-Output files from TRGT
-
-- `{outputdir}/repeat_calling/trgt/multi_sample/{project}/`
- - `*.vcf.gz`: Merged VCF for all samples
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/repeat_calling/trgt/single_sample/{sample}/`
- - `*.vcf.gz`: VCF with called repeats
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
- - `*.bam`: BAM file with sorted spanning reads
- - `*.bai`: Index of the corresponding bam file
-
-
-### Repeat annotation
-
-[Stranger](https://github.com/Clinical-Genomics/stranger) is used to annotate repeats.
-
-
-Output files from Stranger
-
-- `{outputdir}/repeat_annotation/stranger/{sample}`
- - `*.vcf.gz`: Annotated VCF
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
-
+| Path | Description |
+| -------------------------------------------------------------- | ------------------------------- |
+| `{outputdir}/repeat_annotation/stranger/{sample}/*.vcf.gz` | Annotated VCF file |
+| `{outputdir}/repeat_annotation/stranger/{sample}/*.vcf.gz.tbi` | Index of the annotated VCF file |
### SNVs
-#### Calling
-
-[DeepVariant](https://github.com/google/deepvariant) is used to call variants, [bcftools](https://samtools.github.io/bcftools/bcftools.html) and [GLnexus](https://github.com/dnanexus-rnd/GLnexus) are used to merge variants.
-
-
-Output files from SNV calling
-
-> [!NOTE]
-> Variants are only output without annotation and ranking if these subworkflows are turned off.
-
-- `{outputdir}/snvs/single_sample/{sample}/`
- - `{sample}_snv.vcf.gz`: VCF with called variants with alternative genotypes from a certain sample
- - `{sample}_snv.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/snvs/multi_sample/{project}/`
- - `{project}_snv.vcf.gz`: VCF with called variants from all samples
- - `{project}_snv.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/snvs/stats/single_sample/`
- - `*.stats.txt`: Variant statistics
-
-
-#### Annotation
+[DeepVariant](https://github.com/google/deepvariant) is used to call variants, while [bcftools](https://samtools.github.io/bcftools/bcftools.html) and [GLnexus](https://github.com/dnanexus-rnd/GLnexus) are used for merging variants.
-[echtvar](https://github.com/brentp/echtvar) and [VEP](https://www.ensembl.org/vep) are used to annotate SNVs. [CADD](https://cadd.gs.washington.edu/) is used to annotate INDELs with CADD scores.
+!!!note
-
-Output files from SNV Annotation
+ Variants are only output without annotation and ranking if these subworkflows are turned off.
-> [!NOTE]
-> Variants are only output without ranking if that subworkflows are turned off.
+| Path | Description |
+| ------------------------------------------------------ | --------------------------------------------------------------------------- |
+| `snvs/single_sample/{sample}/{sample}_snv.vcf.gz` | VCF file containing called variants with alternative genotypes for a sample |
+| `snvs/single_sample/{sample}/{sample}_snv.vcf.gz.tbi` | Index of the corresponding VCF file |
+| `snvs/multi_sample/{project}/{project}_snv.vcf.gz` | VCF file containing called variants for all samples |
+| `snvs/multi_sample/{project}/{project}_snv.vcf.gz.tbi` | Index of the corresponding VCF file |
+| `snvs/stats/single_sample/*.stats.txt` | Variant statistics |
-- `{outputdir}/databases/echtvar/encode/{project}/`
- - `*.zip`: Database with AF and AC for all samples run
-- `{outputdir}/snvs/single_sample/{sample}/`
- - `{sample}_snv_annotated.vcf.gz`: VCF with annotated variants with alternative genotypes from a certain sample
- - `{sample}_snv_annotated.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/snvs/multi_sample/{project}/`
- - `{project}_snv_annotated.vcf.gz`: VCF with annotated variants from all samples
- - `{project}_snv_annotated.vcf.gz.tbi`: Index of the corresponding VCF file
-
+[echtvar](https://github.com/brentp/echtvar) and [VEP](https://www.ensembl.org/vep) are used for annotating SNVs, while [CADD](https://cadd.gs.washington.edu/) is used to annotate INDELs with CADD scores.
-#### Ranking
+!!!note
-[GENMOD](https://github.com/Clinical-Genomics/genmod) is a simple to use command line tool for annotating and analyzing genomic variations in the VCF file format. GENMOD can annotate genetic patterns of inheritance in vcf files with single or multiple families of arbitrary size. Each variant will be assigned a predicted pathogenicity score. The score will be given both as a raw score and a normalized score with values between 0 and 1. The tags in the INFO field are `RankScore` and `RankScoreNormalized`. The score can be configured to fit your annotations and preferences by modifying the score config file.
+ Variants are only output without ranking if that subworkflows are turned off.
-
-Output files from SNV ranking
+| Path | Description |
+| ---------------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| `databases/echtvar/encode/{project}/*.zip` | Database with allele frequency (AF) and allele count (AC) for all samples |
+| `snvs/single_sample/{sample}/{sample}_snv_annotated.vcf.gz` | VCF file containing annotated variants with alternative genotypes for a sample |
+| `snvs/single_sample/{sample}/{sample}_snv_annotated.vcf.gz.tbi` | Index of the annotated VCF file |
+| `snvs/multi_sample/{project}/{project}_snv_annotated.vcf.gz` | VCF file containing annotated variants for all samples |
+| `snvs/multi_sample/{project}/{project}_snv_annotated.vcf.gz.tbi` | Index of the annotated VCF file |
-- `{outputdir}/snvs/single_sample/{sample}/`
- - `{sample}_snv_annotated_ranked.vcf.gz`: VCF with annotated and ranked variants with alternative genotypes from a certain sample
- - `{sample}_snv_annotated_ranked.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/snvs/multi_sample/{project}/`
- - `{project}_snv_annotated_ranked.vcf.gz`: VCF with annotated and ranked variants from all samples
- - `{project}_snv_annotated_ranked.vcf.gz.tbi`: Index of the corresponding VCF file
-
+[GENMOD](https://github.com/Clinical-Genomics/genmod) is used to rank the annotated SNVs and INDELs.
-### SV Calling
+| Path | Description |
+| ----------------------------------------------------------------------- | ----------------------------------------------------------- |
+| `snvs/single_sample/{sample}/{sample}_snv_annotated_ranked.vcf.gz` | VCF file with annotated and ranked variants for a sample |
+| `snvs/single_sample/{sample}/{sample}_snv_annotated_ranked.vcf.gz.tbi` | Index of the ranked VCF file |
+| `snvs/multi_sample/{project}/{project}_snv_annotated_ranked.vcf.gz` | VCF file with annotated and ranked variants for all samples |
+| `snvs/multi_sample/{project}/{project}_snv_annotated_ranked.vcf.gz.tbi` | Index of the ranked VCF file |
-[Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles](https://github.com/fritzsedlazeck/Sniffles) is used to call structural variants, after wich [SVDB](https://github.com/J35P312/SVDB) is used to merge variants, within and between samples.
+### SVs
-
-Output files from SV Calling
+[Severus](https://github.com/KolmogorovLab/Severus) or [Sniffles](https://github.com/fritzsedlazeck/Sniffles) is used to call structural variants, and [SVDB](https://github.com/J35P312/SVDB) is used to merge variants within and between samples.
-- `{outputdir}/svs/multi_sample/{project}`
- - `{project}_svs.vcf.gz`: VCF file with SVDB merged variants
- - `{project}_svs.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/svs/single_sample/{sample}`
- - `*.vcf.gz`: VCF with SVDB merged variants, divided per sample
- - `*.vcf.gz.tbi`: Index of the corresponding VCF file
-
+!!!note
-### SV Annotation
+ Variants are only output without annotation if that subworkflow is turned off.
-[SVDB](https://github.com/J35P312/SVDB) and [VEP](https://www.ensembl.org/vep) are used to annotate SVs.
+| Path | Description |
+| ----------------------------------------------------- | ------------------------------------------------------------ |
+| `svs/multi_sample/{project}/{project}_svs.vcf.gz` | VCF file with merged structural variants for all samples |
+| `svs/multi_sample/{project}/{project}_svs.vcf.gz.tbi` | Index of the merged VCF file |
+| `svs/single_sample/{sample}/*.vcf.gz` | VCF file with merged structural variants for a single sample |
+| `svs/single_sample/{sample}/*.vcf.gz.tbi` | Index of the VCF file |
-
-Output files from SV Annotation
+[SVDB](https://github.com/J35P312/SVDB) and [VEP](https://www.ensembl.org/vep) are used to annotate structural variants.
-- `{outputdir}/svs/multi_sample/{project}`
- - `{project}_svs_annotated.vcf.gz`: VCF file with annotated merged variants
- - `{project}_svs_annotated.vcf.gz.tbi`: Index of the corresponding VCF file
-- `{outputdir}/svs/single_sample/{sample}`
- - `*.vcf_annotated.gz`: VCF with annotated variants per sample
- - `*.vcf_annotated.gz.tbi`: Index of the corresponding VCF file
-
+| Path | Description |
+| --------------------------------------------------------------- | ------------------------------------------------------------------ |
+| `svs/multi_sample/{project}/{project}_svs_annotated.vcf.gz` | VCF file with annotated merged structural variants for all samples |
+| `svs/multi_sample/{project}/{project}_svs_annotated.vcf.gz.tbi` | Index of the annotated VCF file |
+| `svs/single_sample/{sample}/*.vcf_annotated.gz` | VCF file with annotated structural variants for a single sample |
+| `svs/single_sample/{sample}/*.vcf_annotated.gz.tbi` | Index of the annotated VCF file |
diff --git a/docs/usage.md b/docs/usage.md
index dadf1238..6eea24c9 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -1,14 +1,13 @@
# genomic-medicine-sweden/nallo: Usage
-## Introduction
-
-genomic-medicine-sweden/nallo is a bioinformatics analysis pipeline to analyse long-read data.
-
## Prerequisites
1. Install Nextflow (>=24.04.2) using the instructions [here.](https://nextflow.io/docs/latest/getstarted.html#installation)
2. Install one of the following technologies for full pipeline reproducibility: Docker, Singularity, Podman, Shifter or Charliecloud.
- > Almost all nf-core pipelines give you the option to use conda as well. However, some tools used in genomic-medicine-sweden/nallo do not have a conda package so we do not support conda at the moment.
+
+!!!warning
+
+ Almost all nf-core pipelines give you the option to use conda as well. However, some tools used in genomic-medicine-sweden/nallo do not have a conda package so we do not support conda at the moment.
## Getting started
@@ -22,8 +21,10 @@ nextflow run genomic-medicine-sweden/nallo \
--outdir
```
-> Check [nf-core/configs](https://github.com/nf-core/configs/tree/master/conf) to see if a custom config file to run nf-core pipelines already exists for your institute. If so, you can simply use `-profile test,` in your command. This enables the appropriate package manager and sets the appropriate execution settings for your machine.
-> NB: The order of profiles is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.
+!!!note
+
+ Check [nf-core/configs](https://github.com/nf-core/configs/tree/master/conf) to see if a custom config file to run nf-core pipelines already exists for your institute. If so, you can simply use `-profile test,` in your command. This enables the appropriate package manager and sets the appropriate execution settings for your machine.
+ NB: The order of profiles is important! They are loaded in sequence, so later profiles can overwrite earlier profiles.
Running the command creates the following files in your working directory
@@ -34,8 +35,9 @@ work # Directory containing the Nextflow working files
# Other Nextflow hidden files, like history of pipeline logs.
```
-> [!NOTE]
-> The default cpu and memory configurations used in nallo are written keeping the test profile (and dataset, which is tiny) in mind. You should override these values in configs to get it to work on larger datasets. Check the section `custom-configuration` below to know more about how to configure resources for your platform.
+!!!note
+
+ The default cpu and memory configurations used in nallo are written keeping the test profile (and dataset, which is tiny) in mind. You should override these values in configs to get it to work on larger datasets. Check the section `custom-configuration` below to know more about how to configure resources for your platform.
### Updating the pipeline
@@ -78,21 +80,24 @@ testrun,HG003,/path/to/HG003.bam,FAM,0,0,2,1
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
-## Preset
+## Presets
+
+This pipeline comes with three different presets that should be set with the `--preset` parameter: `revio` (default), `pacbio` or `ONT_R10`.
-This pipeline comes with three different presets that should be set with the `--preset` parameter
+!!!note "Effect of preset on subworkflows"
-- `revio` (default)
-- `pacbio`
-- `ONT_R10`
+ The selected preset will turn off subworkflows:
-`--skip_assembly_wf` and `--skip_repeat_wf` will be set to true for `ONT_R10` and `--skip_methylation_wf` will be set to true for `pacbio`, meaning these subworkflows are not run.
+ - `--skip_assembly_wf` and `--skip_repeat_wf` will be set to `true` for `ONT_R10`
+ - `--skip_methylation_wf` will be set to `true` for `pacbio`
## Subworkflows
As indicated above, this pipeline is divided into multiple subworkflows, each with its own input requirements and outputs. By default, all subworklows are active, and thus all mandatory input files are required.
-The only parameter mandatory for all subworkflows is the `--input` and `--outdir` parameters, all other parameters are determined by the active subworkflows. If you would run `nextflow run genomic-medicine-sweden/nallo -profile docker --outdir results --input samplesheet.csv`
+The only mandatory parameters for all subworkflows is the `--input` and `--outdir` parameters, all other parameters are determined by the active subworkflows.
+
+For example, if you would run `nextflow run genomic-medicine-sweden/nallo -profile docker --outdir results --input samplesheet.csv`, the pipeline will try to guide you through which files are required:
```
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -104,9 +109,11 @@ The only parameter mandatory for all subworkflows is the `--input` and `--outdir
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
-The pipeline will try to guide you through which files are required, but a thorough description is provided below.
+A thorough description of required files are provided below.
+
+Additionally, if you want to skip a subworkflow, you will need to explicitly state to skip all subworkflows that rely on it.
-Additionally, if you want to skip a subworkflow, you will need to explicitly state to skip all subworklow that relies on it. For example, `nextflow run genomic-medicine-sweden/nallo -profile docker --outdir results --input samplesheet.csv --skip_mapping_wf` will tell you
+For example, `nextflow run genomic-medicine-sweden/nallo -profile docker --outdir results --input samplesheet.csv --skip_mapping_wf` will tell you
```
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -121,7 +128,7 @@ Because almost all other subworkflows relies on the mapping subworkflow.
As descibed above, the files required depend on the active subworkflows. All parameters are listed [here](parameters.md), but the most useful parameters needed to run the pipeline described in more detail below.
-### Mapping (`--skip_mapping_wf`)
+### Mapping
The majority of subworkflows depend on the mapping (alignment) subworkflow which requires `--fasta` and `--somalier_sites`.
@@ -130,11 +137,15 @@ The majority of subworkflows depend on the mapping (alignment) subworkflow which
| `fasta` | Reference genome, either gzipped or uncompressed FASTA (e.g. [GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use)) |
| `somalier_sites` | A VCF of known polymorphic sites (e.g. [sites.hg38.vcg.gz](https://github.com/brentp/somalier/files/3412456/sites.hg38.vcf.gz)), from which sex will be inferred if possible. |
-### QC (`--skip_qc`)
+Turned off with `--skip_mapping_wf`.
+
+### QC
This subworkflow depends on the mapping subworkflow, but requires no additional files.
-### Assembly (`--skip_assembly_wf`)
+Turned off with `--skip_qc`.
+
+### Assembly
This subworkflow contains both genome assembly and assembly variant calling. The assemblyt variant calling needs the sex of samples and for samples with unknown sex this is inferred from aligned reads, therefore it depends on the mapping subworkflow.
@@ -144,17 +155,22 @@ It requires a BED file with PAR regions.
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| `par_regions` | A BED file with PAR regions (e.g. [GRCh38_PAR.bed](https://storage.googleapis.com/deepvariant/case-study-testdata/GRCh38_PAR.bed)) |
-> [!NOTE]
-> Make sure chrY PAR is hard masked in reference genome you are using.
+!!!warning
+
+ Make sure chrY PAR is hard masked in reference genome you are using.
+
+Turned off with `--skip_assembly_wf`.
-### Call paralogs (`--skip_call_paralogs`)
+### Call paralogs
This subworkflow depends on the mapping subworkflow, but requires no additional files.
-> [!NOTE]
-> Only GRCh38 is supported.
+!!warning
+Only GRCh38 is supported.
-### Short variant calling (`--skip_short_variant_calling`)
+Turned off with `--skip_call_paralogs`.
+
+### Short variant calling
This subworkflow depends on the mapping subworkflow, and required the same PAR regions file as the assembly workflow.
@@ -162,7 +178,9 @@ This subworkflow depends on the mapping subworkflow, and required the same PAR r
| ------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `par_regions` | A BED file with PAR regions (e.g. [GRCh38_PAR.bed](ttps://storage.googleapis.com/deepvariant/case-study-testdata/GRCh38_PAR.bed)) |
-### CNV calling (`--skip_cnv_calling`)
+Turned off with `--skip_short_variant_calling`.
+
+### CNV calling
This subworkflow depends on the mapping and short variant calling subworkflows, and requires the following additional files:
@@ -172,15 +190,21 @@ This subworkflow depends on the mapping and short variant calling subworkflows,
| `hificnv_xx` | expected XX copy number regions for your reference genome (e.g. [expected_cn.hg38.XX.bed](https://github.com/PacificBiosciences/HiFiCNV/raw/main/data/expected_cn/expected_cn.hg38.XX.bed)) |
| `hificnv_exclude` | BED file specifying regions to exclude (e.g. [cnv.excluded_regions.hg38.bed.gz](https://github.com/PacificBiosciences/HiFiCNV/raw/main/data/excluded_regions/cnv.excluded_regions.hg38.bed.gz)) |
-### Phasing (`--skip_phasing_wf`)
+Turned off with `--skip_cnv_calling`.
+
+### Phasing
This subworkflow phases variants and haplotags aligned BAM files, and such relies on the mapping and short variant calling subworkflows, but requires no additional files.
-### Methylation (`--skip_methylation_wf`)
+Turned off with `--skip_phasing_wf`.
+
+### Methylation
This subworkflow relies on mapping, short variant calling and phasing subworkflows, but requires no additional files.
-### Repeat calling (`--skip_repeat_calling`)
+Turned off with `--skip_methylation_wf`.
+
+### Repeat calling
This subworkflow requires haplotagged BAM files, and such relies on the mapping, short variant calling and phasing subworkflows, and requires the following additional files:
@@ -188,7 +212,9 @@ This subworkflow requires haplotagged BAM files, and such relies on the mapping,
| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `trgt_repeats` | a BED file with tandem repeats matching your reference genome (e.g. [pathogenic_repeats.hg38.bed](https://github.com/PacificBiosciences/trgt/raw/main/repeats/pathogenic_repeats.hg38.bed)>)) |
-### Repeat annotation (`--skip_repeat_annotation`)
+Turned off with `--skip_repeat_calling`.
+
+### Repeat annotation
This subworkflow relies on the mapping, short variant calling, phasing and repeat calling subworkflows, and requires the following additional files:
@@ -196,7 +222,9 @@ This subworkflow relies on the mapping, short variant calling, phasing and repea
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `variant_catalog` | a variant catalog matching your reference (e.g. [variant_catalog_grch38.json](https://github.com/Clinical-Genomics/stranger/raw/main/stranger/resources/variant_catalog_grch38.json)) |
-### SNV annotation (`--skip_snv_annotation`)
+Turned off with `--skip_repeat_annotation`.
+
+### SNV annotation
This subworkflow relies on the mapping and short variant calling, and requires the following additional files:
@@ -228,21 +256,36 @@ gnomad,/path/to/gnomad.v3.1.2.echtvar.popmax.v2.zip
cadd,/path/to/cadd.v1.6.hg38.zip
```
-> [!WARNING]
-> Generating an echtvar database from a VCF-file is a fairly straightforward process described on the [echtvar GitHub](https://github.com/brentp/echtvar). However, the pre-made `gnomad.v3.1.2.echtvar.v2.zip` provided by them results in malformed INFO lines that are not compatible with genmod (run in the subsequent ranking subworkflow).
->
-> For a very small test database that only overlaps the coordinates of the pipeline test data set, you could use [`cadd.v1.6.hg38.test_data.zip`](https://github.com/genomic-medicine-sweden/test-datasets/raw/refs/heads/nallo/reference/cadd.v1.6.hg38.test_data.zip) to get started.
+!!!warning
+
+ Generating an echtvar database from a VCF-file is a fairly straightforward process described on the [echtvar GitHub](https://github.com/brentp/echtvar). However, the pre-made `gnomad.v3.1.2.echtvar.v2.zip` provided by them results in malformed INFO lines that are not compatible with genmod (run in the subsequent ranking subworkflow).
+
+ For a very small test database that only overlaps the coordinates of the pipeline test data set, you could use [`cadd.v1.6.hg38.test_data.zip`](https://github.com/genomic-medicine-sweden/test-datasets/raw/refs/heads/nallo/reference/cadd.v1.6.hg38.test_data.zip) to get started.
-> [!NOTE]
-> Optionally, to calcuate CADD scores for small indels, supply a path to a folder containing cadd annotations with `--cadd_resources` and prescored indels with `--cadd_prescored`. Equivalent of the `data/annotations/` and `data/prescored/` folders described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation). CADD scores for SNVs can be annotated through echvtvar and `--snp_db`.
+!!!tip
-### SV annotation (`--skip_sv_annotation`)
+ Optionally, to calcuate CADD scores for small indels, supply a path to a folder containing cadd annotations with `--cadd_resources` and prescored indels with `--cadd_prescored`. Equivalent of the `data/annotations/` and `data/prescored/` folders described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation). CADD scores for SNVs can be annotated through echvtvar and `--snp_db`.
+
+Turned off with `--skip_snv_annotation`.
+
+### Rank SNVs and INDELs
+
+This subworkflow ranks SNVs, and relies on the mapping, short variant calling and SNV annotation subworkflows, and requires the following additional files:
+
+| Parameter | Description |
+| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `score_config_snv` | Used by GENMOD when ranking variants. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/rank_model_snv.ini). |
+| `reduced_penetrance` | A list of loci that show [reduced penetrance](https://medlineplus.gov/genetics/understanding/inheritance/penetranceexpressivity/) in people. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/reduced_penetrance.tsv) |
+
+`--skip_rank_variants`.
+
+### SV annotation
This subworkflow relies on the mapping subworkflow, and requires the following additional files:
-| Parameter | Description |
-| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `svdb_dbs` 1 | Csv file with databases used for structural variant annotation in vcf format. Help
Path to comma-separated file containing information about the databases used for structural variant annotation. |
+| Parameter | Description |
+| ----------------------- | ----------------------------------------------------------------------------- |
+| `svdb_dbs` 1 | Csv file with databases used for structural variant annotation in vcf format. |
1 Example file for input with `--svdb_dbs`:
@@ -253,22 +296,15 @@ https://github.com/genomic-medicine-sweden/test-datasets/raw/b9ff54b59cdd39df5b6
These databases could for example come from [CoLoRSdb](https://zenodo.org/records/13145123).
-### Rank variants (`--skip_rank_variants`)
-
-This subworkflow ranks SNVs, and relies on the mapping, short variant calling and SNV annotation subworkflows, and requires the following additional files:
-
-| Parameter | Description |
-| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `score_config_snv` | Used by GENMOD when ranking variants. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/rank_model_snv.ini). |
-| `reduced_penetrance` | A list of loci that show [reduced penetrance](https://medlineplus.gov/genetics/understanding/inheritance/penetranceexpressivity/) in people. Sample file [here](https://github.com/nf-core/test-datasets/blob/raredisease/reference/reduced_penetrance.tsv) |
+Turned off with `--skip_sv_annotation`.
-### Other highlighted parameters
+## Other highlighted parameters
- Limit SNV calling to regions in BED file (`--bed`).
- By default SNV-calling is split into 13 parallel processes, this speeds up the variant calling significantly. Limit this by setting `--parallel_snv` to a different number.
- By default the pipeline does not perform parallel alignment, but this can be changed by setting `--parallel_alignments` to split the alignment into multiple processes. This comes with some additional overhead, but speeds up the alignment significantly.
-### Reproducibility
+## Reproducibility
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
@@ -278,13 +314,15 @@ This version number will be logged in reports when you run the pipeline, so that
To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
-> [!TIP]
-> If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
+!!!tip
+
+ If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
## Core Nextflow arguments
-> [!NOTE]
-> These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen).
+!!!note
+
+ These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen).
### `-profile`
@@ -377,21 +415,31 @@ NXF_OPTS='-Xms1g -Xmx4g'
## Running the pipeline without internet access
-The pipeline and container images can be downloaded using [nf-core tools](https://nf-co.re/docs/usage/offline). For running offline, you of course have to make all the reference data available locally, and specify `--fasta`, etc., see [above](#reference-files-and-parameters).
+### Download pipeline and containers
-Contrary to the paragraph about [Nextflow](https://nf-co.re/docs/usage/offline#nextflow) on the page linked above, it is not possible to use the "-all" packaged version of Nextflow for this pipeline. The online version of Nextflow is necessary to support the necessary nextflow plugins. Download instead the file called just `nextflow`. Nextflow will download its dependencies when it is run. Additionally, you need to download the nf-validation plugin explicitly:
+The pipeline and container images can be downloaded using `nf-core download`, e.g.:
+```bash
+nf-core download genomic-medicine-sweden/nallo -r 0.3.2
```
-./nextflow plugin install nf-validation
-```
-Now you can transfer the `nextflow` binary as well as its directory `$HOME/.nextflow` to the system without Internet access, and use it there. It is necessary to use an explicit version of `nf-validation` offline, or Nextflow will check for the most recent version online. Find the version of nf-validation you downloaded in `$HOME/.nextflow/plugins`, then specify this version for `nf-validation` in your configuration file:
+### Download references
+
+When running offline, you will have to make all the reference data available locally. The test profile will not be able to fetch data automatically.
+
+### Download plugins
+
+[This](https://nf-co.re/docs/usage/offline#nextflow) section from the nf-core docs should be followed to download and transfer nextflow plugins from a computer connected to the internet to the offline environment.
+
+It is necessary to use an explicit version of `nf-validation` offline, or Nextflow will check for the most recent version online.
+
+Find the version of nf-validation you downloaded in `$HOME/.nextflow/plugins`, then specify this version for `nf-validation` in your configuration file:
```
plugins {
// Set the plugin version explicitly, otherwise nextflow will look for the newest version online.
- id 'nf-validation@1.1.3'
+ id 'nf-schema@2.1.1'
}
```
-This should go in your Nextflow confgiguration file, specified with `-c ` when running the pipeline.
+This should go in your Nextflow configuration file, specified with `-c ` when running the pipeline.