Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into update-pretrained-models
Browse files Browse the repository at this point in the history
  • Loading branch information
meyerkm committed May 21, 2024
2 parents 0111c2b + 82cc76f commit 28efec3
Show file tree
Hide file tree
Showing 28 changed files with 1,031 additions and 607 deletions.
68 changes: 5 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,75 +4,17 @@ Rare variant association testing using deep learning and data-driven burden scor

[![Documentation Status](https://readthedocs.org/projects/deeprvat/badge/?version=latest)](https://deeprvat.readthedocs.io/en/latest/?badge=latest)

## Installation

1. Clone this repository:
```
git clone [email protected]:PMBio/deeprvat.git
```
1. Change directory to the repository: `cd deeprvat`
1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda`

*note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.*
```
mamba env create -n deeprvat -f deeprvat_env.yaml
```
1. Activate the environment: `mamba activate deeprvat`
1. Install the `deeprvat` package: `pip install -e .`
## Installation and usage

If you don't want to install the gpu related requirements use the `deeprvat_env_no_gpu.yml` environment instead.
```
mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml
```
Please consult our [documentation](https://deeprvat.readthedocs.io/en/latest/)


## Basic usage
## Citation

### Customize pipelines
If you use this package, please cite:

Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule.

If you are running on an computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles.


### Run the preprocessing pipeline on VCF files

Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html)


### Annotate variants

Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html)



### Try the full training and association testing pipeline on some example data

```
mkdir example
cd example
ln -s [path_to_deeprvat]/example/* .
snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile
```

Replace `[path_to_deeprvat]` with the path to your clone of the repository.

Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed.


### Run the association testing pipeline with pretrained models

```
mkdir example
cd example
ln -s [path_to_deeprvat]/example/* .
ln -s [path_to_deeprvat]/pretrained_models
snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile
```

Replace `[path_to_deeprvat]` with the path to your clone of the repository.

Again, note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed.
Clarke, Holtkamp et al., “Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Genetics.” bioRxiv. https://dx.doi.org/10.1101/2023.07.12.548506


## Credits
Expand Down
30 changes: 30 additions & 0 deletions deeprvat/annotations/annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -1898,6 +1898,36 @@ def process_vep(
return vep_file


@cli.command()
@click.argument("anno_df_in", type=click.Path(exists=True))
@click.argument("anno_df_out", type=click.Path())
def compute_plof(anno_df_in, anno_df_out):
"""
Cumputes and adds plof column based on plof function.
Parameters:
- anno_df_in(str): File path of annotation file to read in
- anno_df_out(str): File path of output file
Returns:
None
Example: deeprvat_annotations compute_plof annotations.parquet annotations_plof.parquet
"""
anno_df = pd.read_parquet(anno_df_in)
PLOF_COLS = [
"Consequence_stop_gained",
"Consequence_frameshift_variant",
"Consequence_stop_lost",
"Consequence_start_lost",
"Consequence_splice_acceptor_variant",
"Consequence_splice_donor_variant",
]

anno_df["is_plof"] = anno_df[PLOF_COLS].eq(1).any(axis=1).astype(int)
anno_df.to_parquet(anno_df_out)


@cli.command()
@click.argument("filenames", type=str)
@click.argument("out_file", type=click.Path())
Expand Down
1 change: 1 addition & 0 deletions deeprvat/deeprvat/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ def get_pvals(results, method_mapping=None, phenotype_mapping={}):
"gene",
"experiment_group",
"Discovery type",
"beta",
"pval",
"-log10pval",
"pval_corrected",
Expand Down
576 changes: 259 additions & 317 deletions docs/_static/annotations_rulegraph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/annotations.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DeepRVAT Annotation pipeline

This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)
This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computed using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)

![dag](_static/annotations_rulegraph.svg)

Expand Down Expand Up @@ -30,7 +30,7 @@ BCFtools as well as HTSlib should be installed on the machine,

should be installed for running the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
Download paths:
- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz"
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"
- [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz)
Expand Down
17 changes: 17 additions & 0 deletions docs/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Cluster execution

## Pipeline resource requirements

For cluster exectution, resource requirements are expected under `resources:` in all rules. All pipelines have some suggested resource requirements, but they may need to be adjusted for your data or cluster.


## Cluster execution

If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles). We have tested execution on LSF. If you run into issues running on other clusters, please [let us know](https://github.com/PMBio/deeprvat/issues).


## Execution on GPU vs. CPU

Two steps in the pipelines use GPU by default: Training (rule `train` from [train.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/training/train.snakefile)) and burden computation (rule `compute_burdens` from [burdens.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/association_testing/burdens.snakefile)). To run on CPU on a computing cluster, you may need to remove the line `gpus = 1` from the `resources:` of those rules.

Bear in mind that this will make burden computation substantially slower, but still feasible for most datasets. Training without GPU is not practical on large datasets such as UK Biobank.
Loading

0 comments on commit 28efec3

Please sign in to comment.