Skip to content

Commit

Permalink
Merge pull request #176 from molgenis/feat/noGnomad
Browse files Browse the repository at this point in the history
Remove Gnomad_HN and drop GRCh37 from the docs
  • Loading branch information
dennishendriksen authored Feb 7, 2024
2 parents 63441b5 + 2f6c447 commit e65e3d2
Show file tree
Hide file tree
Showing 18 changed files with 22 additions and 38 deletions.
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
os: linux
dist: focal
language: python
dist: jammy
language: java
python:
- '3.10'
jdk: openjdk17
cache: pip
branches:
only:
Expand Down
23 changes: 4 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,12 @@ below.
CAPICE can be used as online service at http://molgenis.org/capice

## Requirements
The list below is a complete list. Depending on whether GRCh37 and/or GRCh38 is used and whether all
mentioned features are used, some items in the list below can be skipped.

* VEP v107
* Including VEP cache (which needs to be unarchived!):
* [homo_sapiens_refseq_vep_107_GRCh37](http://ftp.ensembl.org/pub/release-107/variation/indexed_vep_cache/homo_sapiens_refseq_vep_107_GRCh37.tar.gz)
* [homo_sapiens_refseq_vep_107_GRCh38](http://ftp.ensembl.org/pub/release-107/variation/indexed_vep_cache/homo_sapiens_refseq_vep_107_GRCh38.tar.gz)
* Including plugin(s):
* [Grantham](https://github.com/molgenis/vip/blob/master/resources/vep/plugins/Grantham.pm)
* [SpliceAI](https://github.com/molgenis/vip/blob/master/resources/vep/plugins/SpliceAI.pm)
* Including additional data (GRCh37) [available here](https://download.molgeniscloud.org/downloads/vip/resources/GRCh37/):
* `gnomad.total.r2.1.1.sites.stripped.vcf.gz`
* `gnomad.total.r2.1.1.sites.stripped.vcf.gz.csi`
* `hg19.100way.phyloP100way.bw`
* `spliceai_scores.masked.indel.hg19.vcf.gz`
* `spliceai_scores.masked.indel.hg19.vcf.gz.tbi`
* `spliceai_scores.masked.snv.hg19.vcf.gz`
* `spliceai_scores.masked.snv.hg19.vcf.gz.tbi`
* Including additional data (GRCh38) [available here](https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/):
* `gnomad.genomes.v3.1.2.sites.stripped.vcf.gz`
* `gnomad.genomes.v3.1.2.sites.stripped.vcf.gz.csi`
Expand Down Expand Up @@ -95,15 +83,14 @@ command:
vep --input_file <path to your input file> --format vcf --output_file <path to your output file> \
--vcf --compress_output gzip --sift s --polyphen s --numbers --symbol \
--shift_3prime 1 --allele_number --refseq --total_length --no_stats --offline --cache \
--dir_cache </path/to/cache/107> --species "homo_sapiens" --assembly <GRCh37 or GRCh38> \
--dir_cache </path/to/cache/107> --species "homo_sapiens" --assembly <GRCh38> \
--fork <n_threads> --dont_skip --allow_non_variant --use_given_ref --exclude_predicted \
--flag_pick_allele --plugin Grantham \
--plugin SpliceAI,snv=<path/to/spliceai_scores.masked.snv.vcf.gz>,indel=</path/to/spliceai_scores.masked.indel.vcf.gz> \
--custom "<path/to/gnomad.total.sites.stripped.vcf.gz>,gnomAD,vcf,exact,0,AF,HN" \
--custom "<path/to/phyloP100way.bw>,phyloP,bigwig,exact,0" \
--custom "<path/to/gnomad.total.sites.stripped.vcf.gz>,gnomAD,vcf,exact,0,AF" \
--custom "<path/to/hg38.phyloP100way.bed.gz,phyloP,bed,exact,0" \
--dir_plugins <path to your VEP plugin directory>
```
**IMPORTANT: Ensure the right files are used based on GRCH37 or GRCH38!!!**

Note: Certain arguments might not be needed if training/predicting without using all possible features offered by CAPICE.

Expand Down Expand Up @@ -333,19 +320,17 @@ To convert a `.pickle.dat` model to `.ubj`/`.json`, one can do the following (us
- `sys.argv[1]`: the input `.pickle.dat` file
- `sys.argv[2]`: the output `.json` file
4. Adjust `CAPICE_version` within the new model file to that of the new release (major version should match!).
It is advisable however to ensure that the filename still contains the version it was trained on (f.e. `v4.0.0_grch37_v5.0.0-compatibility.json`) to ensure no confusion will exist on which version it was actually trained.
It is advisable however to ensure that the filename still contains the version it was trained on (e.g. `v4.0.0_grch38_v5.0.0-compatibility.json`) to ensure no confusion will exist on which version it was actually trained.

Do note that we recommend using a model trained on a specific major version instead, as other breaking changes might be present as well!

## Data sources
### GnomAD
The gnomAD files can be generated through the following scripts (which also download the gnomAD files):
- GRCH37: https://github.com/molgenis/vip/blob/main/utils/create_gnomad_GRCh37.sh
- GRCH38: https://github.com/molgenis/vip/blob/main/utils/create_gnomad_GRCh38.sh

### PhyloP
PhyloP resources can be downloaded from:
- GRCh37: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phyloP100way/
- GRCh38: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/

### SpliceAI
Expand Down
Binary file modified resources/predict_input.tsv.gz
Binary file not shown.
Binary file modified resources/predict_input_raw.vcf.gz
Binary file not shown.
22 changes: 11 additions & 11 deletions resources/test_input.vcf
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
##fileformat=VCFv4.0
##reference=GRCh37/hg19
##reference=GRCh38
#CHROM POS ID REF ALT
12 69747417 . C A
17 41231346 . G T
2 122288533 . C A
11 118382645 . G T
5 235382 . G A
2 48026421 . T C
5 90073785 . C T
1 63114155 . T C
2 179431764 . G A
9 131250286 . G A
chr12 69747417 . C A
chr17 41231346 . G T
chr2 122288533 . C A
chr11 118382645 . G T
chr5 235382 . G A
chr2 48026421 . T C
chr5 90073785 . C T
chr1 63114155 . T C
chr2 179431764 . G A
chr9 131250286 . G A
3 changes: 1 addition & 2 deletions resources/train_features.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,5 @@
"SpliceAI_pred_DS_DG": null,
"SpliceAI_pred_DS_DL": null,
"Grantham": null,
"phyloP": null,
"gnomAD_HN": null
"phyloP": null
}
Binary file removed resources/train_input.tsv.gz
Binary file not shown.
Binary file modified resources/train_input_raw.vcf.gz
Binary file not shown.
Binary file added resources/train_test.tsv.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion src/molgenis/capice/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '5.1.1'
__version__ = '5.1.2'
5 changes: 2 additions & 3 deletions tests/capice/test_main_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def tearDown(self):

def setUp(self):
print('Performing test:')
train_file = os.path.join(_project_root_directory, 'resources', 'train_input.tsv.gz')
train_file = os.path.join(_project_root_directory, 'resources', 'train_test.tsv.gz')
impute_json = os.path.join(_project_root_directory,
'resources',
'train_features.json')
Expand Down Expand Up @@ -106,8 +106,7 @@ def test_integration_reset_train_features(self):
'is_NMD_transcript_variant', 'is_feature_elongation', 'is_feature_truncation',
'SpliceAI_pred_DP_AG', 'SpliceAI_pred_DP_AL', 'SpliceAI_pred_DP_DG',
'SpliceAI_pred_DP_DL', 'SpliceAI_pred_DS_AG', 'SpliceAI_pred_DS_AL',
'SpliceAI_pred_DS_DG', 'SpliceAI_pred_DS_DL', 'Type', 'Length', 'Grantham', 'phyloP',
'gnomAD_HN'
'SpliceAI_pred_DS_DG', 'SpliceAI_pred_DS_DL', 'Type', 'Length', 'Grantham', 'phyloP'
]
self.assertSetEqual(set(observed), set(expected))

Expand Down
Binary file modified tests/resources/breakends.vcf.gz
100755 → 100644
Binary file not shown.
Binary file modified tests/resources/breakends_vep.tsv.gz
Binary file not shown.
Binary file modified tests/resources/edge_cases.vcf.gz
100755 → 100644
Binary file not shown.
Binary file modified tests/resources/edge_cases_vep.tsv.gz
Binary file not shown.
Binary file modified tests/resources/symbolic_alleles.vcf.gz
100755 → 100644
Binary file not shown.
Binary file modified tests/resources/symbolic_alleles_vep.tsv.gz
Binary file not shown.
Binary file added tests/resources/xgb_booster_poc.ubj
Binary file not shown.

0 comments on commit e65e3d2

Please sign in to comment.