Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate SVs #408

Merged
merged 8 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
- "SHORT_VARIANT_CALLING"
- "SNV_ANNOTATION"
- "CALL_SVS"
- "ANNOTATE_SVS"
profile:
- "docker"

Expand Down
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ lint:
- .github/workflows/awstest.yml
- .github/workflows/awsfulltest.yml
- conf/modules.config
- conf/igenomes_ignored.config
files_unchanged:
- CODE_OF_CONDUCT.md
- assets/nf-core-nallo_logo_light.png
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#388](https://github.com/genomic-medicine-sweden/nallo/pull/388) - Added single-sample tbi output to the short variant calling subworkflow
- [#393](https://github.com/genomic-medicine-sweden/nallo/pull/393) - Added a new `--minimap2_read_mapping_preset` parameter
- [#403](https://github.com/genomic-medicine-sweden/nallo/pull/403) - Added `FOUND_IN=hificnv` tags to CNV calling output
- [#408](https://github.com/genomic-medicine-sweden/nallo/pull/408) - Added a new subworkflow to annotate SVs
- [#417](https://github.com/genomic-medicine-sweden/nallo/pull/417) - Added `FOUND_IN=deepvariant` tags to SNV calling output

### `Changed`
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,10 @@

> Nilsson D, Magnusson M. moonso/stranger v0.7.1. Published online February 18, 2021. doi:10.5281/ZENODO.4548873
- [SVDB](https://github.com/J35P312/SVDB)

> Eisfeldt et al., 2017.
- [Tabix](https://academic.oup.com/bioinformatics/article/27/5/718/262743)

> Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718-719. doi:10.1093/bioinformatics/btq671
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@

- Annotate SNVs and INDELs with databases of choice, i.e. [gnomAD](https://gnomad.broadinstitute.org), [CADD](https://cadd.gs.washington.edu) etc. with [echtvar](https://github.com/brentp/echtvar) and [VEP](https://github.com/Ensembl/ensembl-vep)
- Annotate repeat expansions with [stranger](https://github.com/Clinical-Genomics/stranger)
- Annotate SVs with [SVDB](https://github.com/J35P312/SVDB) and [VEP](https://github.com/Ensembl/ensembl-vep)

##### Ranking

Expand Down
40 changes: 40 additions & 0 deletions assets/svdb_query_vcf_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-core/raredisease/master/assets/mobile_element_references_schema.json",
"title": "Schema for SVDB query - VCF",
"description": "Schema for the SVDB query database input, VCF version",
"type": "array",
"items": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.vcf?(\\.gz)?$",
"errorMessage": "Path to query database cannot contain spaces and must be a vcf file"
},
"in_freq_info_key": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "In frequency key cannot contain spaces"
},
"in_allele_count_info_key": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "In allele count key cannot contain spaces"
},
"out_freq_info_key": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Out frequency key must be provided and cannot contain spaces"
},
"out_allele_count_info_key": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Out allele count key must be provided and cannot contain spaces"
}
},
"required": ["filename", "out_freq_info_key", "out_allele_count_info_key"]
}
}
57 changes: 57 additions & 0 deletions conf/modules/annotate_svs.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available keys to override module options:
ext.args = Additional arguments appended to command in module.
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
ext.prefix = File name prefix for output files.
----------------------------------------------------------------------------------------
*/

process {

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Annotate SVs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

withName: '.*:ANNOTATE_SVS:.*' {
publishDir = [
enabled: false,
]
}

withName: '.*ANNOTATE_SVS:ENSEMBLVEP_SV' {
ext.args = { [
"${params.extra_vep_options}",
"--dir_plugins .",
'--plugin pLI,pLI_values.txt',
'--appris --biotype --buffer_size 100 --canonical --cache --ccds',
'--compress_output bgzip --distance 5000 --domains',
'--exclude_predicted --force_overwrite --format vcf',
'--hgvs --humdiv --max_sv_size 248387328',
'--no_progress --numbers --per_gene --polyphen p',
'--protein --offline --sift p --regulatory',
'--symbol --tsl --uniprot --vcf',
'--no_stats'
].join(' ') }
ext.prefix = { "${meta.id}_svs_annotated" }
publishDir = [
path: { "${params.outdir}/svs/multi_sample/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: '.*ANNOTATE_SVS:TABIX_ENSEMBLVEP_SV' {
publishDir = [
path: { "${params.outdir}/svs/multi_sample/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

}
7 changes: 1 addition & 6 deletions conf/modules/call_svs.config
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ process {
publishDir = [
path: { "${params.outdir}/svs/multi_sample/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
saveAs: { filename -> filename.equals('versions.yml') || !params.skip_sv_annotation ? null : filename }
]
}

Expand All @@ -67,10 +67,5 @@ process {
'--output-type z',
'--write-index=tbi'
].join(' ')
publishDir = [
path: { "${params.outdir}/svs/single_sample/${meta.id}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
}
26 changes: 25 additions & 1 deletion conf/modules/general.config
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ process {
]
}

withName: '.*:NALLO:BCFTOOLS_PLUGINSPLIT' {
withName: '.*:NALLO:BCFTOOLS_PLUGINSPLIT_SNVS' {
ext.args = [
'-i \'GT="alt"\'',
'--output-type z',
Expand All @@ -81,6 +81,30 @@ process {
]
}

withName: '.*:NALLO:BCFTOOLS_PLUGINSPLIT_SVS' {
ext.args = [
'-i \'GT="alt"\'',
'--output-type z',
'--write-index=tbi'
].join(' ')
publishDir = [
path: { "${params.outdir}/svs/single_sample/" },
mode: params.publish_dir_mode,
// Can't use prefix as it would come from the original file
saveAs: { filename ->
if (filename.equals('versions.yml')) {
null
} else {
def matcher = filename =~ /(.+)(\.vcf\.gz(?:\.tbi)?)$/
def sample = matcher[0][1]
def extension = matcher[0][2]
def annotated = params.skip_sv_annotation ? "" : "_annotated"
"${sample}/${sample}_svs${annotated}${extension}"
}
}
]
}

withName: '.*:NALLO:SAMPLESHEET_PED' {
publishDir = [
enabled: false
Expand Down
2 changes: 1 addition & 1 deletion conf/modules/snv_annotation.config
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ process {
].join(' ')
}

withName: '.*:SNV_ANNOTATION:ENSEMBLVEP_VEP' {
withName: '.*:SNV_ANNOTATION:ENSEMBLVEP_SNV' {
ext.prefix = { "${meta.id}_vep" }
ext.args = { [
"${params.extra_vep_options}",
Expand Down
22 changes: 12 additions & 10 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,6 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function'
Expand All @@ -26,7 +18,7 @@ params {
modules_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/'

// Base directory for genomic-medicine-sweden/nallo test data
pipelines_testdata_base_path = 'https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/1f4e062926fc10f70a38e917e5771edb333e89bf/'
pipelines_testdata_base_path = 'https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/7be7114cb132be8cae9343f225bcf42ec11ecc1b/'

// References
fasta = params.pipelines_testdata_base_path + 'reference/hg38.test.fa.gz'
Expand All @@ -41,6 +33,7 @@ params {
vep_cache = params.pipelines_testdata_base_path + 'reference/vep_cache_test_data.tar.gz'
vep_plugin_files = params.pipelines_testdata_base_path + 'reference/vep_plugin_files.csv'
snp_db = params.pipelines_testdata_base_path + 'testdata/snp_dbs.csv'
svdb_dbs = params.pipelines_testdata_base_path + 'testdata/svdb_dbs.csv'
reduced_penetrance = params.pipelines_testdata_base_path + 'reference/reduced_penetrance.tsv'
score_config_snv = params.pipelines_testdata_base_path + 'reference/rank_model_snv.ini'
variant_consequences_snv = params.pipelines_testdata_base_path + 'reference/variant_consequences_v2.txt'
Expand All @@ -59,7 +52,7 @@ params {

// Impose same minimum Nextflow version as in nextflow.config
manifest {
nextflowVersion = '!>=23.04.0'
nextflowVersion = '!>=24.04.2'
}

// Disable all Nextflow reporting options
Expand All @@ -69,16 +62,25 @@ trace { enabled = false }
dag { enabled = false }

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
withLabel: 'process_high' {
cpus = 4
memory = '15.GB'
}
withLabel: 'process_medium' {
cpus = 2
memory = '7.GB'
}
withLabel: 'process_low' {
cpus = 1
memory = '3.GB'
}
withLabel: 'process_single' {
cpus = 1
memory = '3.GB'
}
}
16 changes: 16 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
- [Ranking](#ranking)
- [Ranked Variants](#ranked-variants)
- [SV Calling](#sv-calling)
- [SV Annotation](#sv-annotation)

## Pipeline overview

Expand Down Expand Up @@ -348,3 +349,18 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ
- `*.vcf.gz`: VCF with variants per sample
- `*.vcf.gz.tbi`: Index of the corresponding VCF file
</details>

### SV Annotation

[SVDB](https://github.com/J35P312/SVDB) and [VEP](https://www.ensembl.org/vep) are used to annotate SVs.

<details markdown="1">
<summary>Output files from SV Annotation</summary>

- `{outputdir}/svs/multi_sample/{project}`
- `{project}_svs_annotated.vcf.gz`: VCF file with annotated merged variants
- `{project}_svs_annotated.vcf.gz.tbi`: Index of the corresponding VCF file
- `{outputdir}/svs/single_sample/{sample}`
- `*.vcf_annotated.gz`: VCF with annotated variants per sample
- `*.vcf_annotated.gz.tbi`: Index of the corresponding VCF file
</details>
4 changes: 3 additions & 1 deletion docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Allows skipping certain parts of the pipeline
| `skip_repeat_annotation` | Skip tandem repeat annotation | `boolean` | False | | |
| `skip_phasing_wf` | Skip phasing of variants and haplotagging of reads | `boolean` | False | | |
| `skip_snv_annotation` | Skip short variant annotation | `boolean` | False | | |
| `skip_sv_annotation` | Skip structural variant annotation | `boolean` | False | | |
| `skip_cnv_calling` | Skip CNV calling | `boolean` | False | | |
| `skip_call_paralogs` | Skip the calling of specific paralogous genes | `boolean` | False | | |
| `skip_rank_variants` | Skip ranking of short variants | `boolean` | False | | |
Expand All @@ -37,6 +38,7 @@ Define where the pipeline should find input data and save output data.
| `tandem_repeats` | A tandem repeat BED file for sniffles | `string` | | | |
| `trgt_repeats` | A BED file with repeats to be genotyped with TRGT | `string` | | | |
| `snp_db` | A csv file with echtvar databases to annotate SNVs with | `string` | | | |
| `svdb_dbs` | Databases used for structural variant annotation in vcf format. <details><summary>Help</summary><small>Path to comma-separated file containing information about the databases used for structural variant annotation.</small></details>| `string` | | | |
| `variant_catalog` | A variant catalog json-file for stranger | `string` | | | |
| `variant_consequences_snv` | File containing list of SO terms listed in the order of severity from most severe to lease severe for annotating genomic SNVs. For more information check https://ensembl.org/info/genome/variation/prediction/predicted_data.html | `string` | | | |
| `vep_cache` | A path to the VEP cache location | `string` | | | |
Expand All @@ -47,7 +49,7 @@ Define where the pipeline should find input data and save output data.
| `reduced_penetrance` | A file with gene ids that have reduced penetrance. For use with genmod. | `string` | | | |
| `score_config_snv` | A SNV rank model config file for genmod. | `string` | | | |
| `somalier_sites` | A VCF of known polymorphic sites for somalier | `string` | | | |
| `pipelines_testdata_base_path` | Base URL or local path to location of pipeline test dataset files | `string` | https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/1f4e062926fc10f70a38e917e5771edb333e89bf/ | | True |
| `pipelines_testdata_base_path` | Base URL or local path to location of pipeline test dataset files | `string` | https://raw.githubusercontent.com/genomic-medicine-sweden/test-datasets/7be7114cb132be8cae9343f225bcf42ec11ecc1b/ | | True |

## Reference genome options

Expand Down
19 changes: 18 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,26 @@ cadd,/path/to/cadd.v1.6.hg38.zip
> [!NOTE]
> Optionally, to calcuate CADD scores for small indels, supply a path to a folder containing cadd annotations with `--cadd_resources` and prescored indels with `--cadd_prescored`. Equivalent of the `data/annotations/` and `data/prescored/` folders described [here](https://github.com/kircherlab/CADD-scripts/#manual-installation). CADD scores for SNVs can be annotated through echvtvar and `--snp_db`.

### SV annotation (`--skip_sv_annotation`)

This subworkflow relies on the mapping subworkflow, and requires the following additional files:

| Parameter | Description |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `svdb_dbs` <sup>1</sup> | Csv file with databases used for structural variant annotation in vcf format. <details><summary>Help</summary><small>Path to comma-separated file containing information about the databases used for structural variant annotation.</small></details> |

<sup>1</sup> Example file for input with `--svdb_dbs`:

```
filename,in_freq_info_key,in_allele_count_info_key,out_freq_info_key,out_allele_count_info_key
https://github.com/genomic-medicine-sweden/test-datasets/raw/b9ff54b59cdd39df5b6e278a30b08d94075a644c/reference/colorsdb.test_data.vcf.gz,AF,AC,colorsdb_af,colorsdb_ac
```

These databases could for example come from [CoLoRSdb](https://zenodo.org/records/13145123).

### Rank variants (`--skip_rank_variants`)

This subworkflow relies on the mapping, short variant calling and SNV annotation subworkflows, and requires the following additional files:
This subworkflow ranks SNVs, and relies on the mapping, short variant calling and SNV annotation subworkflows, and requires the following additional files:

| Parameter | Description |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,11 @@
"git_sha": "4806239588f35d27a95b187b4000d80e15152022",
"installed_by": ["modules"]
},
"svdb/query": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"tabix/bgzip": {
"branch": "master",
"git_sha": "b20be35facfc5acdc1259f132ed79339d79e989f",
Expand Down
5 changes: 5 additions & 0 deletions modules/nf-core/svdb/query/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading