Merge remote-tracking branch 'origin/main' into update-pretrained-models

PMBio · May 21, 2024 · 28efec3 · 28efec3
2 parents 0111c2b + 82cc76f
commit 28efec3
Show file tree

Hide file tree

Showing 28 changed files with 1,031 additions and 607 deletions.
diff --git a/README.md b/README.md
@@ -4,75 +4,17 @@ Rare variant association testing using deep learning and data-driven burden scor
 
 [![Documentation Status](https://readthedocs.org/projects/deeprvat/badge/?version=latest)](https://deeprvat.readthedocs.io/en/latest/?badge=latest)
 
-## Installation
 
-1. Clone this repository:
-```
-git clone [email protected]:PMBio/deeprvat.git
-```
-1. Change directory to the repository: `cd deeprvat`
-1. Install the conda environment. We recommend using [mamba](https://mamba.readthedocs.io/en/latest/index.html), though you may also replace `mamba` with `conda` 
-
-   *note: [the current deeprvat env does not support cuda when installed with conda](https://github.com/PMBio/deeprvat/issues/16), install using mamba for cuda support.*
-```
-mamba env create -n deeprvat -f deeprvat_env.yaml 
-```
-1. Activate the environment: `mamba activate deeprvat`
-1. Install the `deeprvat` package: `pip install -e .`
+## Installation and usage
 
-If you don't want to install the gpu related requirements use the `deeprvat_env_no_gpu.yml` environment instead.
-```
-mamba env create -n deeprvat -f deeprvat_env_no_gpu.yaml 
-```
+Please consult our [documentation](https://deeprvat.readthedocs.io/en/latest/)
 
 
-## Basic usage
+## Citation
 
-### Customize pipelines
+If you use this package, please cite:
 
-Before running any of the snakefiles, you may want to adjust the number of threads used by different steps in the pipeline. To do this, modify the `threads:` property of a given rule.
-
-If you are running on an computing cluster, you will need a [profile](https://github.com/snakemake-profiles) and may need to add `resources:` directives to the snakefiles.
-
-
-### Run the preprocessing pipeline on VCF files
-
-Instructions [here](https://deeprvat.readthedocs.io/en/latest/preprocessing.html)
-
-
-### Annotate variants
-
-Instructions [here](https://deeprvat.readthedocs.io/en/latest/annotations.html)
-
-
-
-### Try the full training and association testing pipeline on some example data
-
-```
-mkdir example
-cd example
-ln -s [path_to_deeprvat]/example/* .
-snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/training_association_testing.snakefile
-```
-
-Replace `[path_to_deeprvat]` with the path to your clone of the repository.
-
-Note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed.
-
-
-### Run the association testing pipeline with pretrained models
-
-```
-mkdir example
-cd example
-ln -s [path_to_deeprvat]/example/* .
-ln -s [path_to_deeprvat]/pretrained_models
-snakemake -j 1 --snakefile [path_to_deeprvat]/pipelines/association_testing_pretrained.snakefile
-```
-
-Replace `[path_to_deeprvat]` with the path to your clone of the repository.
-
-Again, note that the example data is randomly generated, and so is only suited for testing whether the `deeprvat` package has been correctly installed.
+Clarke, Holtkamp et al., “Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Genetics.” bioRxiv. https://dx.doi.org/10.1101/2023.07.12.548506
 
 
 ## Credits

diff --git a/deeprvat/annotations/annotations.py b/deeprvat/annotations/annotations.py
@@ -1898,6 +1898,36 @@ def process_vep(
     return vep_file
 
 
+@cli.command()
+@click.argument("anno_df_in", type=click.Path(exists=True))
+@click.argument("anno_df_out", type=click.Path())
+def compute_plof(anno_df_in, anno_df_out):
+    """
+    Cumputes and adds plof column based on plof function.
+
+    Parameters:
+    - anno_df_in(str): File path of annotation file to read in
+    - anno_df_out(str): File path of output file
+
+    Returns:
+    None
+
+    Example: deeprvat_annotations compute_plof annotations.parquet annotations_plof.parquet
+    """
+    anno_df = pd.read_parquet(anno_df_in)
+    PLOF_COLS = [
+        "Consequence_stop_gained",
+        "Consequence_frameshift_variant",
+        "Consequence_stop_lost",
+        "Consequence_start_lost",
+        "Consequence_splice_acceptor_variant",
+        "Consequence_splice_donor_variant",
+    ]
+
+    anno_df["is_plof"] = anno_df[PLOF_COLS].eq(1).any(axis=1).astype(int)
+    anno_df.to_parquet(anno_df_out)
+
+
 @cli.command()
 @click.argument("filenames", type=str)
 @click.argument("out_file", type=click.Path())

diff --git a/deeprvat/deeprvat/evaluate.py b/deeprvat/deeprvat/evaluate.py
@@ -189,6 +189,7 @@ def get_pvals(results, method_mapping=None, phenotype_mapping={}):
             "gene",
             "experiment_group",
             "Discovery type",
+            "beta",
             "pval",
             "-log10pval",
             "pval_corrected",

diff --git a/docs/_static/annotations_rulegraph.svg b/docs/_static/annotations_rulegraph.svg
diff --git a/docs/annotations.md b/docs/annotations.md
@@ -1,6 +1,6 @@
 # DeepRVAT Annotation pipeline
 
-This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and  [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)
+This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and  [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computed using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)
 
 ![dag](_static/annotations_rulegraph.svg)
 
@@ -30,7 +30,7 @@ BCFtools as well as HTSlib should be installed on the machine,
 
 should be installed for running the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml). 
 Download paths:
-- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their  Tabix Indices
+- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
 - [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz" 
 - [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"
 - [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz) 

diff --git a/docs/cluster.md b/docs/cluster.md
@@ -0,0 +1,17 @@
+# Cluster execution
+
+## Pipeline resource requirements
+
+For cluster exectution, resource requirements are expected under `resources:` in all rules. All pipelines have some suggested resource requirements, but they may need to be adjusted for your data or cluster.
+
+
+## Cluster execution
+
+If you are running on a computing cluster, you will need a [profile](https://github.com/snakemake-profiles). We have tested execution on LSF. If you run into issues running on other clusters, please [let us know](https://github.com/PMBio/deeprvat/issues).
+
+
+## Execution on GPU vs. CPU
+
+Two steps in the pipelines use GPU by default: Training (rule `train` from [train.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/training/train.snakefile)) and burden computation (rule `compute_burdens` from [burdens.snakefile](https://github.com/PMBio/deeprvat/blob/main/pipelines/association_testing/burdens.snakefile)). To run on CPU on a computing cluster, you may need to remove the line `gpus = 1` from the `resources:` of those rules.
+
+Bear in mind that this will make burden computation substantially slower, but still feasible for most datasets. Training without GPU is not practical on large datasets such as UK Biobank.