Skip to content

Commit

Permalink
Merge pull request #137 from molgenis/feat/new_features
Browse files Browse the repository at this point in the history
Feat/new features + VEP update
  • Loading branch information
SietsmaRJ authored Oct 4, 2022
2 parents f72baa4 + a8f46b1 commit 3d020b9
Show file tree
Hide file tree
Showing 9 changed files with 42 additions and 11 deletions.
47 changes: 38 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,34 @@ below.
CAPICE can be used as online service at http://molgenis.org/capice

## Requirements
* VEP v105
The list below is a complete list. Depending on whether GRCh37 and/or GRCh38 is used and whether all
mentioned features are used, some items in the list below can be skipped.

* VEP v107
* Including VEP cache (which needs to be unarchived!):
* [homo_sapiens_refseq_vep_107_GRCh37](http://ftp.ensembl.org/pub/release-107/variation/indexed_vep_cache/homo_sapiens_refseq_vep_107_GRCh37.tar.gz)
* [homo_sapiens_refseq_vep_107_GRCh38](http://ftp.ensembl.org/pub/release-107/variation/indexed_vep_cache/homo_sapiens_refseq_vep_107_GRCh38.tar.gz)
* Including plugin(s):
* [SpliceAI](https://m.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#spliceai)
* [Grantham](https://github.com/molgenis/vip/blob/master/resources/vep/plugins/Grantham.pm)
* [SpliceAI](https://github.com/molgenis/vip/blob/master/resources/vep/plugins/SpliceAI.pm)
* Including additional data (GRCh37) [available here](https://download.molgeniscloud.org/downloads/vip/resources/GRCh37/):
* `gnomad.total.r2.1.1.sites.stripped.vcf.gz`
* `gnomad.total.r2.1.1.sites.stripped.vcf.gz.csi`
* `hg19.100way.phyloP100way.bw`
* `spliceai_scores.masked.indel.hg19.vcf.gz`
* `spliceai_scores.masked.indel.hg19.vcf.gz.tbi`
* `spliceai_scores.masked.snv.hg19.vcf.gz`
* `spliceai_scores.masked.snv.hg19.vcf.gz.tbi`
* Including additional data (GRCh38) [available here](https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/):
* `gnomad.genomes.v3.1.2.sites.stripped.vcf.gz`
* `gnomad.genomes.v3.1.2.sites.stripped.vcf.gz.csi`
* `hg38.phyloP100way.bw`
* `spliceai_scores.masked.indel.hg38.vcf.gz`
* `spliceai_scores.masked.indel.hg38.vcf.gz.tbi`
* `spliceai_scores.masked.snv.hg38.vcf.gz`
* `spliceai_scores.masked.snv.hg38.vcf.gz.tbi`
* BCF tools v1.14-1
* Python >=3.8
* [Stripped gnomAD allele frequency counts](https://download.molgeniscloud.org/downloads/vip/resources/GRCh37/gnomad.total.r2.1.1.sites.stripped.vcf.gz) with [indexing file](https://download.molgeniscloud.org/downloads/vip/resources/GRCh37/gnomad.total.r2.1.1.sites.stripped.vcf.gz.csi) (for GRCh38: [data](https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/gnomad.genomes.v3.1.2.sites.stripped.vcf.gz) + [indexing file](https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/gnomad.genomes.v3.1.2.sites.stripped.vcf.gz.csi))

## Install
The CAPICE software is also provided in this repository for running CAPICE in your own environment. The following
Expand Down Expand Up @@ -67,13 +89,20 @@ In order to score your variants through CAPICE, you have to annotate your varian
command:

```commandline
vep --input_file <path to your input file> --format vcf --output_file <path to your output file> --vcf --compress_output gzip --force_overwrite
--sift s --polyphen s --numbers --symbol --shift_3prime 1 --allele_number --refseq --total_length
--no_stats --offline --cache --dir_cache </path/to/cache/105> --species "homo_sapiens" --assembly <GRCh37 or GRCh38> --fork 4
--dont_skip --allow_non_variant --use_given_ref --exclude_predicted --flag_pick_allele
--plugin SpliceAI,snv=<path/to/raw_scores_snv.vcf.gz>,indel=</path/to/raw_scores_indel.vcf.gz> --dir_plugins <path to your VEP plugin directory>
--custom </path/to/stripped/gnomad.vcf.gz>,gnomAD,vcf,exact,0,AF,HN
vep --input_file <path to your input file> --format vcf --output_file <path to your output file> \
--vcf --compress_output gzip --sift s --polyphen s --numbers --symbol \
--shift_3prime 1 --allele_number --refseq --total_length --no_stats --offline --cache \
--dir_cache </path/to/cache/107> --species "homo_sapiens" --assembly <GRCh37 or GRCh38> \
--fork <n_threads> --dont_skip --allow_non_variant --use_given_ref --exclude_predicted \
--flag_pick_allele --plugin Grantham \
--plugin SpliceAI,snv=<path/to/spliceai_scores.masked.snv.vcf.gz>,indel=</path/to/spliceai_scores.masked.indel.vcf.gz> \
--custom "<path/to/gnomad.total.sites.stripped.vcf.gz>,gnomAD,vcf,exact,0,AF,HN" \
--custom "<path/to/phyloP100way.bw>,phyloP,bigwig,exact,0" \
--dir_plugins <path to your VEP plugin directory>
```
**IMPORTANT: Ensure the right files are used based on GRCH37 or GRCH38!!!**

Note: Certain arguments might not be needed if training/predicting without using all possible features offered by CAPICE.

Then you have to convert the VEP output to TSV using our own BCFTools script:
`./scripts/convert_vep_vcf_to_tsv_capice.sh -i </path/to/vep_output.vcf.gz> -o </path/to/capice_input.tsv.gz>`
Expand Down
Binary file modified resources/predict_input.tsv.gz
Binary file not shown.
4 changes: 3 additions & 1 deletion resources/train_features.json
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,7 @@
"SpliceAI_pred_DS_DG": null,
"SpliceAI_pred_DS_DL": null,
"Type": null,
"Length": null
"Length": null,
"Grantham": null,
"phyloP": null
}
Binary file modified resources/train_input.tsv.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion src/molgenis/capice/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '4.0.0-rc2'
__version__ = '4.0.0-rc3'
Binary file modified tests/resources/breakends_vep.tsv.gz
Binary file not shown.
Binary file modified tests/resources/edge_cases_vep.tsv.gz
Binary file not shown.
Binary file modified tests/resources/symbolic_alleles_vep.tsv.gz
Binary file not shown.
Binary file modified tests/resources/xgb_booster_poc.pickle.dat
Binary file not shown.

0 comments on commit 3d020b9

Please sign in to comment.