diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..128e07e --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,22 @@ +name: Documentation +on: + push: + branches: + - main + - '**' # matches every branch +permissions: + contents: write +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - uses: actions/setup-python@v4 + with: + python-version: 3.x + - uses: actions/cache@v2 + with: + key: ${{ github.ref }} + path: .cache + - run: pip install mkdocs-material + - run: mkdocs gh-deploy --force diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..1269488 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +data diff --git a/README.md b/README.md index 37759de..8ba02fb 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,49 @@ - + +
+ +
+
+ +# RIVET +### SARS-CoV-2 RecombInation ViEwer and Tracker + +[license-badge]: https://img.shields.io/badge/License-MIT-yellow.svg +[license-link]: https://github.com/TurakhiaLab/rivet/blob/main/LICENSE +[![License][license-badge]][license-link] + +
+
+ View RIVET's Latest Detected Recombinants +
+
+ +
+ RIVET is a software pipeline and visual web platform to perform SARS-CoV-2 recombination inference using RIPPLES and organize the relevant information in order to greatly accelerate the process of identifying and tracking SARS-CoV-2 recombinants. + +

+ + Overview + + | + + Documentation + + | + + Getting Started + +

+
+
+
-# RIVET: SARS-CoV-2 RecombInation ViEwer and Tracker ## Table of Contents - [Overview](#overview) -- [RIVET Frontend](#rivet_frontend) - - [Viewing your own recombinants with RIVET on a local HTTP server](#rivet_local) - - [Example](#rivet_example) -- [RIVET Backend](#rivet_backend) - - [RIVET Backend Setup](#rivet_backend_setup) - - [RIVET Backend Pipeline Results](#rivet_backend_results) +- [RIVET SARS-CoV-2 Web Interface](#web) +- [Use RIVET Locally](#local) - [Citing RIVET](#cite_rivet) - ## Overview
@@ -22,138 +53,28 @@ RIVET is a program designed to aid in SARS-CoV-2 recombination analysis and cons 1. [Backend](#rivet_backend): RIVET's backend pipeline uses [RIPPLES](https://www.nature.com/articles/s41586-022-05189-9) for recombination detection in a [mutation-annotated tree](https://usher-wiki.readthedocs.io/en/latest/UShER.html) and has a subsequent automated filtration pipeline to flag potential false-positives resulting from bioinformatic, contamination or other sequencing errors. Next, the recombination results are ranked and additional results/metadata files are generated by the RIVET backend pipeline that can be loaded by the RIVET frontend. 2. [Frontend](#rivet_frontend): The RIVET frontend is an interactive, web-browser interface for online visualization, tracking, and analysis of recombination detection results. -We routinely run RIVET's backend pipeline on the [SARS-CoV-2 global MAT](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) that is publicly shared by UCSC and make these results available for visualization on https://rivet.ucsd.edu/. - -## RIVET Frontend - -### Viewing your own recombinants with RIVET on a local HTTP server - -RIVET can also be run locally to visualize SNVs of potential recombinant sequences, with the recombinant-informative sites highlighted. The following two files are minimally required to run RIVET locally:
- -- `results.txt`: a tab-separated file with one recombinant sequence per row, that must contain your recombinant node id, donor node id and acceptor node id as the first three column entires, as seen below. Additional columns after the first three can be added to this results file, which will be rendered in the UI table.
+We routinely run RIVET's backend pipeline on the latest [SARS-CoV-2 global MAT](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) that is publicly shared by UCSC and make these results available for analysis and visualization at [https://rivet.ucsd.edu/]([https://rivet.ucsd.edu/). - | Recombinant Node ID | Donor Node ID | Acceptor Node ID | - | ------------------------- | ----------- | ---------------- | - | node_1156861 | node_1155169 | node_1167556 | - | node_1067629 | node_1021823 | node_1156861 |
-- `VCF` file containing SNVs for all trio nodes. Each node (recomb/donor/acceptor) in the first three columns of every row in your results tsv file, should be included in this VCF file. Please note, when constructing your `VCF` file, that currently RIVET only supports viewing SNVs, and not indels or SVs. Please see the following workflow to [create a VCF](docs/create_vcf.md) for uploading to RIVET locally. +## RIVET SARS-CoV-2 Web Interface +To support ongoing SARS-CoV-2 recombinant lineage designation and genomic surveillance efforts, we provide a web interface ([https://rivet.ucsd.edu/](https://rivet.ucsd.edu/)) to summarize the results from running the `RIVET` backend on the latest SARS-CoV-2 mutation-annotated tree. The `RIVET` web interface provides a suite of analysis and visualization tools to support rapid interpretation of detected recombinants, and provides integration with several tools such as `UShER`, `Taxonium` and `Nextstrain/Auspice`. -
- -- `config.yaml`: A config file is provided for running RIVET locally. **The default `environment` field is set to `local` and should not be changed for running RIVET as a local HTTP server.** . Additional fields in the config file are provided to the user to customize various elements of the SNV plot coloring, such as the color of nucleotide bases or the color of the highlighted recombinant-informative sites in the visual. Please feel free to change these colorings according to your own preference. +We currently plan to support weekly updates of the `RIVET` web interface with the goal of helping to support and accelerate the laborious process of SARS-CoV-2 recombinant lineage designation. -### Example -All the RIVET dependencies have been added to Conda environment setup, that can be found in the `install` directory. -Run the following commands to activate the `rivet` Conda environment. -``` -conda env create -f install/rivet_env.yml -conda activate rivet -```
-Example data files are provided under the `example` directory. -Run the following command and past the URL to your local web-browser to see the RIVET UI locally. -``` -python3 rivet-frontend.py -v example/trios_example.vcf -r example/final_recombinants_example.txt -c config.yaml -``` -Type the following help command to see these the options and their descriptions: -``` -python3 rivet-frontend.py --help -``` - - -## RIVET Backend - -The RIVET backend uses [RIPPLES](https://www.nature.com/articles/s41586-022-05189-9) for recombination detection. For more information on the RIPPLES algorithm and the automated filtration piepeline, please see: [Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape](https://doi.org/10.1038/s41586-022-05189-9) - -### RIVET Backend Setup - -Please see the following docs for setting up an account with Google Cloud Platform: [GCP Setup Docs](docs/gcp_setup.md) - -For ease of use, the entire RIVET backend pipeline, including recombinant ranking, is contained within a pre-built public docker image. - - -To launch a Docker shell, run the following two commands. -- Note: Put your GCP service account key file (obtained following the docs linked above) in the corresponding location as the command below or update the location in the command below: -``` -KEY=~/.config/gcloud/ -docker run -it -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/ -v ${KEY}:/tmp/keys/:ro mrkylesmith/ripples_pipeline:latest -``` - -This will drop you into Docker shell where you can launch a RIVET job on GCP.
- -Copy the config template into the current directory to customize for your current RIVET job. -``` -cp template/ripples.yaml . -``` -Add all RIVET runtime parameters and GCP machine configurations to your ripples.yaml file. - -```yaml -# GCP credentials -bucket_id: -project_id: -# Path to key file -# If inside Docker shell, make sure key_file matches /tmp/keys/ -key_file: - -# GCP machine and Storage Bucket config -# Number of GCP machines to use (data automatically partitioned/parallelized) -instances: -# Don't change boot disk size -boot_disk_size: 40 -machine_type: n2d-highcpu-32 - -# Format job_name/logging (follow this format of top_folder_name/logging) -# Format job_name/results (follow this format of top_folder_name/results) -# will be a unique folder name in your GCP Storage bucket -logging: -results: - -# Ripples parameters config [REQURIED] -version: ripples-fast -# Name of mutation-annotated tree -mat: public-date.all.masked.pb -# Naming for Newick tree that will be generated by pipeline -newick: tree.nwk -# Name of metadata file, corresponding to input MAT -metadata: public-date.metadata.tsv.gz -# Date in format: 2023-01-31 (year-month-day) -date: -# SARS-CoV-2 reference genome placed in top level folder of GCP Storage Bucket (keep same name: reference.fa) -reference: reference.fa - -# Ripples parameters -# Minimum number of leaves that a node should have to be considered for recombination -num_descendants: 5 - -``` - -**Note:** The following files with the same naming from the above config need to be placed in the top level directory of a GCP Storage Bucket (`bucket_id`) ahead of time: -- `mat`: Obtain a [SARS-CoV-2 global MAT](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) -- `metadata`: Obtain [metadata](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) matching the input MAT -- `raw sequence files:` Downloadable at the following links, for a given `$TREE_DATE`: - - `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.$TREE_DATE/genbank.fa.xz` - - `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/cogUk.$TREE_DATE/cog_all.fasta.xz` +**For more information on how to navagate the `RIVET` web interface, please see our documentation page here: [Web Interface Walkthrough](https://turakhialab.github.io/rivet/start/features)**
-After the `ripples.yaml` config has been completely filled out, to launch the RIVET job on GCP instances simply run the following command: -``` -python3 run.py -``` -### RIVET Backend Results -The pipeline will create a local results directory, based on the name given for the `results` field in `ripples.yaml` +## Use RIVET Locally +The `RIVET` backend and frontend components can also be installed and used locally to infer putative recombinants in your sequences and visualize the results locally in your browser. -The pipeline will automatically output the following four files within your local `results` directory: - -- `final_recombinants.txt`: a txt file containing the detected recombinants, with the recombinant node id, donor node id and acceptor node id as the first three columns in the file. The rest of the columns contain information about each detected recombinant, including clade/lineage assignments, 3SEQ M,N,K and p-values, a representative descendant (containing the fewest additional mutations wrt the recombinant node), recombinant ranking scores, and other information to be displayed by the RIVET frontend. -- `trios.vcf.gz`: VCF file containing the SNVs of each trio (recombinant and its parents) node. -- `sample_descedants.txt.gz`: a TSV file containing a mapping from each trio node id, to a set of descendant samples. -- `.taxonium.jsonl.gz`: a jsonl file used by RIVET frontend to display the recombinant node trios within the context of the global phylogeny, powered by Taxonium and Treenome. +**For more information on this workflow, please see our documentation page available here: [Use RIVET Locally](https://turakhialab.github.io/rivet/installation/installation)** +
## Citing RIVET Please cite the following papers if you found this website helpful in your research: diff --git a/docs/contributing/contributing.md b/docs/contributing/contributing.md new file mode 100644 index 0000000..645e41e --- /dev/null +++ b/docs/contributing/contributing.md @@ -0,0 +1,3 @@ +# Contributing + +## Documentation diff --git a/docs/gcp_setup.md b/docs/gcp_setup.md index ea58df8..9ba6c69 100644 --- a/docs/gcp_setup.md +++ b/docs/gcp_setup.md @@ -1,48 +1,39 @@ -## Setup your Google Cloud Platform Account +# Setup Google Cloud Platform Account ___ -1. **Setup Cloud Console:** - - If needed, please follow these instructions to open Cloud Console, create a project storage bucket (`bucket_id`) and project ID (`project_id`): [Installation and Setup](https://cloud.google.com/deployment-manager/docs/step-by-step-guide/installation-and-setup) +## Setup Cloud Console: +!!! info + Follow these instructions to open Cloud Console, create a project storage bucket (`bucket_id`) and project ID (`project_id`): [Installation and Setup](https://cloud.google.com/deployment-manager/docs/step-by-step-guide/installation-and-setup) -
-2. **Enabling APIs** +## Enabling GCP APIs - - Click the following link and under the section titled `Before you begin`, go to step 4 (assuming you already have GCP account and have signed in) and click `Enable API`. [Cloud Life Sciences](https://cloud.google.com/life-sciences/docs/how-tos/getting-started)
+ Click the following link and under the section titled `Before you begin`, go to step 4 (assuming you already have GCP account and have signed in) and click `Enable API`. [Cloud Life Sciences](https://cloud.google.com/life-sciences/docs/how-tos/getting-started)
- - Also make sure the following APIs are enabled as well: - - **Compute Engine API** - - **Cloud Logging API** - - You can enable them at the following link, under the section `Enabling an API` and clicking: [Go to APIs & Services](https://cloud.google.com/endpoints/docs/openapi/enable-api) which will take you to your Google Cloud console `APIs & Services` page if you are signed in. If you click `+Enable APIS and SERVICES` you will be able to search for and enable these APIs. +Also make sure the following APIs are enabled as well:
-
+* Compute Engine API
+* Cloud Logging API
-3. **Add a service account**: - - Click the Navagation Menu side bar on the GCP Console and go to `IAM & Admin` -> `Service Accounts`. Click `+Create Service Account`. +You can enable them at the following link, under the section `Enabling an API` and clicking: [Go to APIs & Services](https://cloud.google.com/endpoints/docs/openapi/enable-api) which will take you to your Google Cloud console `APIs & Services` page if you are signed in. If you click `+Enable APIS and SERVICES` you will be able to search for and enable these APIs.
-4. **Create and Download Keys (JSON)** - - Once you have created a service account, you need to add keys to this serivce account. - Click the Navagation Menu side bar on the web console and go to `IAM & Admin` -> `Service Accounts` and click on the active service account you just created from the previous step. - - - Click the `Keys` tab and `ADD KEY` and `Create new key`. Select `JSON` key type. A new `.json` file will automatically be downloaded from your browser. - - - Move this downloaded `.json` file to the following location (or edit the command below for the location of your choice): +## Add a service account: +Click the Navagation Menu side bar on the GCP Console and go to `IAM & Admin` -> `Service Accounts`. Click `+Create Service Account`. - ``` - ~/.config/gcloud/.json - ``` +
- - Then run the following command in your terminal to set the environment variable path to the location where you just placed your downloaded `.json` file. +## Create and Download Keys (JSON) +Once you have created a service account, you need to add keys to this serivce account. +
-
+Click the Navagation Menu side bar on the web console and go to `IAM & Admin` -> `Service Accounts` and click on the active service account you just created from the previous step. - **IMPORTANT NOTE: I would recommend you keep the generated name of the `` file you downloaded, and make sure the naming of all the `` match in the two commands below, and in your `ripples.yaml` configuration file under the `key_file` field, that you will setup once you enter the Docker shell.** +Click the `Keys` tab and `ADD KEY` and `Create new key`. Select `JSON` key type. A new `.json` file will automatically be downloaded from your browser. +Move this downloaded `.json` file to the following location (or edit the command below for the location of your choice): -**Then you will run the following two commands to enter the Docker shell:** -``` -KEY=~/.config/gcloud/ -docker run -it -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/ -v ${KEY}:/tmp/keys/:ro mrkylesmith/ripples_pipeline_dev:latest ``` +~/.config/gcloud/.json +``` \ No newline at end of file diff --git a/docs/images/rivet-icon.png b/docs/images/rivet-icon.png new file mode 100755 index 0000000..9ccc658 Binary files /dev/null and b/docs/images/rivet-icon.png differ diff --git a/docs/images/rivet_backend_diagram.jpg b/docs/images/rivet_backend_diagram.jpg new file mode 100644 index 0000000..6d71f11 Binary files /dev/null and b/docs/images/rivet_backend_diagram.jpg differ diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..11613b5 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,724 @@ + +# Welcome to the RIVET Wiki + +RIVET is a software pipeline and visual web platform to perform SARS-CoV-2 recombination inference using RIPPLES and organize the relevant information in order to greatly accelerate the process of identifying and tracking SARS-CoV-2 recombinants. + +
+
+ +## RIVET Architecture + +RIVET is a program designed to aid in SARS-CoV-2 recombination analysis and consists of backend and frontend components: + +
+1. [Backend](installation/upload.md): RIVET's backend pipeline uses [RIPPLES](https://www.nature.com/articles/s41586-022-05189-9) for recombination detection in a [mutation-annotated tree](https://usher-wiki.readthedocs.io/en/latest/UShER.html) and has a subsequent automated filtration pipeline to flag potential false-positives resulting from bioinformatic, contamination or other sequencing errors. Next, the recombination results are ranked and additional results/metadata files are generated by the RIVET backend pipeline that can be loaded by the RIVET frontend. + +
+2. [Frontend](installation/analyze.md): The RIVET frontend is an interactive, web-browser interface for online visualization, tracking, and analysis of recombination detection results. +
+ +![](images/rivet_backend_diagram.jpg) + +
+
+ +## Web Interface Walkthrough + +
+ +### Selecting Recombinant of Interest +Each row in the results table represents an inferred recombinant. You can **horizontally scroll** to the right to view more columns in the table, and **click** a row to select the recombinant you are interested in visualizing. + + + +For detailed information on each column of the results table, please see the [RIVET Results Table](start/table.md) page. + +### Results Table Next and Previous Buttons +Use the `next` and `previous` buttons shown below to skip to the next recombinant result (next row) and SNV visualization in the table. + + + +!!! tip + You can also use the arrow keys instead of the `next` and `previous` buttons. Use the right arrow key :arrow_forward: and left arrow key :arrow_backward: to skip to `next` and `previous` results respectively. + + +### Sort by Column +The results can be sorted by any column, by **clicking on the column title**, shown below: + +
+ +!!! note + By default, the results are ranked by the `Recombinant Ranking Score`. + + +### Search Table +The table can be searched and the results shown will be filtered down based on the given query. For example, if you want to search for all recombinant results with `XBB` lineage classification just type `XBB` into the search bar. + +### Search by Sample ID +A user can search for recombinant ancestry in specific samples by using the search by sample identifer feature. Click the toggle button to its active state, and then enter the sample identifier into the search bar. When the `Search by sample` toggle is active, normal table search will be disabled and all search queries should be sample identifiers. + +!!! note + Once you have entered the sample identifier into the search bar, it may take a few seconds for the table results to refresh with the results of your query. + + + + +### SNV plot +When a user clicks on a row to select a recombinant of interest the following visualization, shown below, will be rendered. + + + +The above visualization shows all of the single-nucleotide variant (SNV) sites in the recombinant sequence and its two parents (donor/acceptor), with respect to the reference sequence. The recombinant-informative sites are highlighted in orange where the recombinant matches the donor, and blue where the recombinant matches the acceptor. The gene region annotations are shown below the trio sequences in the bottom track. + + +### Query Descendants +For a selected recombinant ancestor node of interest, you might want to query which samples are descendants of this inferred recombinant. Simply **click** the `Recombinant` label to the left of the track to view up to 10,000 sample descendants of that particular recombinant, as seen in the screenshots below. + +You can also click the `Donor` and `Acceptor` labels to query the samples that are descendants of those particular parental nodes. + + + +
+ +The side panel will display the 10,000 sample descendants by default, and you can **click** the `Download Descendants` button to download a `.txt` file containing all sample descendants for the selected trio node. (one per line) + + + +
+ +### Taxonium View +View trio sequences (recombinant/donor/acceptor) in Taxonium/Treenome Global Phylogeny. +!!! note + + The Taxonium view feature is currently only available for public tree results. +
+ + + +
+ +The `Recombinant/Donor/Acceptor` nodes are circled in the global tree. Click the magnify button shown in the image below to zoom into the particular node of interest. + + +
+ +### View UShER Subtree +This feature will take you to the [UCSC UShER](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace) tool, where you can view the tree using [UShER](https://github.com/yatisht/usher). This feature will automatically sample 10 descendants from the recombinant node in order to view the subtree. + + + +!!! warning + + This feature will open a new tab to `UShER` and may take a few minutes to load in the new tab. + +Once finished loading, you will see the following page, where you can view the subtree by clicking `view downsampled global tree in Nextstrain`.
+ + + +
+ + + +
+ +### Recombinant Detailed Overview +To view more even more detailed information about a particular recombinant of interest click the `More Info` button in the `Overview` section. + + +
+ +**This will display the following information:** + +* Current Recombinant Lineage +* Recombinant Origin Date (as inferred by [Chronumental](https://doi.org/10.1101/2021.10.27.465994)) +* Recombinant parental lineages +* Number of sequences descendant from this recombinant +* Earliest descendant sequence +* Most recent descendant sequence +* Countries where descendant sequences have been detected +* Quality Control Checks not passing (otherwise PASS if all QC checks pass) + + + +!!! question + + If there is additional information you would like to know for a particular recombinant of interest, please make this suggestion through a [GitHub Issue](https://github.com/TurakhiaLab/rivet/issues) in our repository. + +
+ +### View Amino Acid Sites +This option shows the amino acid mutations matched with their corresponding nucleotide mutation positions. This feature uses `matUtils summary --translate`, which is built automatically into the `RIVET` backend pipeline. In short, `matUtils` provides a method to compute the correct amino acid translations at each node in the tree, which `RIVET` uses to obtain the amino acid mutations for a given recombinant ancestor node. + +For more information on this method, please see the following [matUtils documentation](https://usher-wiki.readthedocs.io/en/latest/tutorials.html#example-amino-acid-translation-workflow). + + + +
+All coding amino acid translations are annotated above each corresponding SNV position (if any). + + + +
+
+ +## RIVET Results Table +Each of the sections below describes the columns of RIVET's results table of inferred recombinant ancestors. + +**Recombinant Node ID**
+ +* UShER assigned node id for inferred recombinant node + +**Donor Node ID**
+ +* UShER assigned node id for donor (recombinant parentental node) + +**Acceptor Node ID**
+ +* UShER assigned node id for acceptor (recombinant parentental node) + +**Breakpoint 1 Interval**
+ +* RIPPLES inferred breakpoint interval 1 + +**Breakpoint 2 Interval**
+ +* RIPPLES inferred breakpoint interval 2 + +!!! info + + For more information on the `RIPPLES` algorithm, please see: [Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape](https://www.nature.com/articles/s41586-022-05189-9) + + +**Recombinant Clade**
+ +* Recombinant clade classification as assigned by `Nextstrain` + +**Recombinant Lineage**
+ +* Recombinant lineage designation as assigned by `Pangolin` + +**Donor Clade**
+ +* Donor clade classification as assigned by `Nextstrain` + +**Donor Lineage**
+ +* Donor lineage designation as assigned by `Pangolin` + +**Acceptor Clade**
+ +* Acceptor clade classification as assigned by `Nextstrain` + +**Acceptor Lineage**
+ +* Acceptor lineage designation as assigned by `Pangolin` + +**Chronumental-inferred origin date**
+ +* Inferred first emergence of recombinant ancestor sequence using the [Chronumental](https://github.com/theosanderson/chronumental) method, which runs automatically as part of the `RIVET` pipeline. In short, `Chronumental` is a accurate and scalable time-tree estimation method that uses stochastic gradient descent to estimate lengths of time for tree branches under a probabilistic model. For more information on this method, please see the [Chronumental](https://doi.org/10.1101/2021.10.27.465994) paper. + +**Recombinant Ranking Score**
+ +* The ranking score represents a **growth score** that we compute for each inferred recombinant, which is designed to help prioritize recently emerging recombinants and recombinants with many descendant circulating sequences. +* By default, we order the main `RIVET` results table by maximum ranking score, which attempts to prioritize highest concern recombinants of interest at the top of the list.
+ + The recombinant **growth metric** below, *G(R)*, for a recombinant node with a set of descendants *S* is defined below: + +$$ \ G(R) = 2^{-m(R)} * \sum_{s\in S} 2^{-m(s)} $$ + +* In the equation above, and correspond to the number of months (30-day intervals) *đť‘š(đť‘…)* *đť‘š(đť‘ )* +elapsed since the recombinant node was inferred to have originated and its descendant *đť‘…* +sequence was sampled, respectively. The growth score above, *G(R)*, is computed for each +detected recombinant *R*, and the final recombinant list is ranked based on descending growth +scores. + +**Representative Descendant**
+ +* This selected sample is a descendant with the fewest additional mutations as compared to it's recombinant ancestor. + + +**Informative Site Sequence**
+ +* The informative site sequence is a binary string of `A` and `B` for each trio sequence, where an `A` is assigned if the recombinant node allele at the site matches only the donor node allele at that site, or a `B` if the recombinant matched only the acceptor. + + +**3SEQ (M, N, K)**
+ +* 3SEQ M, M, K values used to check individual p-values in a pre-generated 3SEQ p-value table. + + +**3SEQ P-Value**
+ +!!! info + For more information on the `3SEQ` method and its use in `RIPPLES`, please see [Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm](https://academic.oup.com/mbe/article/35/1/247/4318635) and the Supplementary Section of [Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape](https://www.nature.com/articles/s41586-022-05189-9#MOESM1) + + +**Original Parsimony Score**
+ +* The original parsimony placement score on the global phylogeny. + +**Parsimony Score Improvement**
+ +* Highest parsimony score improvement relative to original parsimony score. + + +**Quality Control (QC) Flags**
+ +* This column represents quality control (QC) or filtration checks that where flagged, meaning that this inferred recombinant is not high-confidence and could represent a false-positive recombinant resulting from bioinformatic, contamination or other sequencing errors. + + +!!! info + + For detailed description of each quality control and filtration check performed in `RIVET's` backend pipeline, see the [Quality Control and Filtration Checks](start/filtration.md) page. + +**Common sources of false positive errors in `RIVET’s` pipeline include, but are not limited to:** + +* Contamination, sequencing, or assembly errors in the recombinant or parent sequences +* Missing sequences resulting in artificially long branches in the UCSC public tree +* Misalignments or phylogenetic inconsistencies + + +**Common sources of false negative errors in `RIVET’s` pipeline include, but are not limited to:** + +* Too few recombination-informative sites in the recombinant +* More than two breakpoints are required to explain the recombinant +* Too few descendants of the recombinant or its parent in the UCSC public tree + + +**"Click to View" Taxonium**
+ +* When clicked, this button will open a separate tab launching the [Taxonium](https://taxonium.org/) browser in order to view the particular recombinant trio (recombinant/donor/acceptor) in the context of the global phylogeny. +In short, [Taxonium](https://elifesciences.org/articles/82392) is a visualization tool for exploring large trees. + +
+
+ +## Quality Control and Filtration Checks +
+ +**3SeqP02**
+ +* P-value from 3-seq > 0.2. + +**russPval005**
+ +* False-discovery rate (FDR) of the parsimony improvement > 0.05. (See [Supplementary Text S3 of RIPPLES](https://www.nature.com/articles/s41586-022-05189-9#MOESM1) for details of the null model.) + +**Alt**
+ +* "Alternate": Other recombination trios with the same recombination node have more parsimony improvement, fewer possible breakpoint intervals, or better P-values. + +**cluster**
+ +* All recombination informative mutations occur within a span of 20 nucleotides. + +**redundant**
+ +* More than two of the recombination node, donor node, and acceptor node appear in that of another trios. + +**Informative_sites_clump**
+ +* More than 5 recombination-informative mutations in a 20-nucleotide span. + +**Suspicious_mutation_clump**
+ +* More than 6 mutations (or 3 near indels) in a 20-nucleotide span on any of the donor node, the aceptor node or the recombination node. + +**Too_many_mutations_near_INDELs**
+ +* Too many mutations on 100-nt spans near indels or a string of Ns. + +
+
+ +## Using RIVET for Other Pathogens +
+ +Below are two examples of using `RIVET's` backend pipeline to infer and visualize recombinants of other pathogens beyond SARS-CoV-2. + +!!! warning + Currently, `RIVET's` backend QC/filtration pipeline is specific to SARS-CoV-2 and will not run when using the `RIVET` backend for other pathogens. + + +### Human Respiratory Syncytial Virus (HRSV) Subgroup A + +Below is the SNV visualization resulting from inferring a putative recombinant in an `RSV` mutation-annotated tree (MAT). + + + + +Since the SNV plot for RSV includes many sites, only the region up to around position 1000 is shown in the image above. +**Please click the download button below to view the entire `RSV` SNV plot as an SVG image.** + + + + + +
+ + +### Monkeypox Virus + +Edit the following fields in the `config.yaml` file: + +Change the GenBank file from the default SARS-CoV-2 file to the corresponding GenBank file for your pathogen of interest, Monkeypox virus in this case. + +```yaml +# Pathogen Ref Seq GenBank file +ref_seq: monkeypox.gb +``` + +!!! Warning + Make sure the `environment` field is set to `local`. + +```yaml +environment: local +port: 2000 +``` +If desired, you can change the local `port` at which `RIVET` will host the local HTTP server in your browser. + +
+ +Now run the following command and RIVET will automatically open your browser to launch the frontend results table and SNV visualization. +``` +python3 rivet-frontend.py -r recombination_mpxv.2023-07-01.tsv -v mpxv.2023-07-01.vcf -c config.yaml +``` + +Below is the SNV plot we get for one of the monkeypox virus inferred recombinants. + + + +
+ +!!! check + For pathogens with larger genomes than SARS-CoV-2, you may want to change the step interval of genomic coordinate tick marks. This can be done by changing the `tick_step` field in `RIVET` frontend `config.yaml` file. + +
+
+ +## Use RIVET Locally + + +### Installing RIVET Backend using Docker +!!! Install + Install `Docker` on your machine first. + +For ease of use, the entire `RIVET` backend pipeline, including recombinant ranking, is contained within a pre-built public docker image. + +### Running RIVET Backend Locally On Your Machine +A `RIVET` backend job can be run locally on your machine. +To launch a Docker shell, run the following two commands. +``` +docker run -it mrkylesmith/ripples_pipeline:latest +``` +This will run an interactive `Docker` shell with the necessary `RIVET` environment. + +
+ +Type the following command to ensure your `RIVET` backend environment is configured correctly, and then proceed to the next steps for running a `RIVET` backend job: [Inferring Recombinants Using the RIVET Backend](installation/upload.md) + +``` +python3 rivet-backend.py --help +``` + +
+
+ +### Running RIVET Backend On Google Cloud +We also provide the build-in option of running a parallelized `RIVET` job across a user specified number of Google Cloud Platform (GCP) machines. + +!!! setup + If you would like to use GCP, please see the following docs for setting up an account with Google Cloud Platform: [GCP Setup Docs](gcp_setup.md) + +!!! important + Put your GCP service account key file (obtained following the docs linked above) in the corresponding location as the command below or update the location in the command below: + +To launch a Docker shell using GCP, run the following two commands providing your GCP Authentication keys file. + +``` +KEY=~/.config/gcloud/ +docker run -it -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/ -v ${KEY}:/tmp/keys/:ro mrkylesmith/ripples_pipeline:latest +``` + +
+
+ +### Install RIVET Frontend Locally On Your Machine + +### Clone RIVET Repo Locally +``` +git clone https://github.com/TurakhiaLab/rivet.git +cd rivet +``` + +### Conda Install +!!! Install + Install `Conda` on your machine first. + +All the `RIVET` frontend dependencies have been added to Conda environment setup, that can be found in the `install` directory. + +
+ +Run the following commands to activate the `rivet` Conda environment. +``` +conda env create -f install/rivet_env.yml +conda activate rivet +``` +
+ +Type the following command to ensure your `RIVET` frontend environment is configured correctly, and then proceed to the next steps for using the `RIVET` frontend: [Visualizing Your Results Using the RIVET Frontend](installation/analyze.md) + +``` +python3 rivet-frontend.py --help +``` + +
+
+ +## Inferring Recombinants Using the RIVET Backend + +Infer recombinant ancestry in your own SARS-CoV-2 sequences using `RIVET's` backend. + +!!! Installation + Make sure `RIVET` is installed on your local machine before proceeding. + +
+ +### RIVET Backend + +The `RIVET` backend uses [RIPPLES](https://www.nature.com/articles/s41586-022-05189-9) for SARS-CoV-2 recombination detection. For more information on the `RIPPLES` algorithm please see: [Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape](https://doi.org/10.1038/s41586-022-05189-9) + + + +### RIVET Backend Architecture +A. **RIPPLES Job Orchestrator** +When running a `RIVET` job on Google Cloud Platform (GCP), `RIVET` calculates the number of long branches in the input mutation-annotated tree and partitions them across `n` GCP instances, which is a parameter specified by the user. This stage of the pipeline is responsible for setting up and launching these parallel jobs, as well as monitoring their progress as they run. This stage of the pipeline also initiates a [Chronumental](https://github.com/theosanderson/chronumental) job, to run concurrently as a subprocess on the local machine, which is explained in the following part B. + + +B. **Infer MAT nodes ancestral dates** +In order to infer the emergence of detected ancestral recombinant nodes of interest for ranking and epidemiological prioritization, `RIVET` builds a time-tree using the [Chronumental](https://www.biorxiv.org/content/10.1101/2021.10.27.465994v1) method. This method uses the sample dates provided in the sequence metadata file to build a probabilistic +model for length of time across branches in the tree and is able to infer the dates of all internal nodes in the tree. `RIVET` uses these dates for internal nodes that we label as recombinants. + +C. **Mult-node GCP Workflow** +When running a `RIVET` job on GCP, the `RIPPLES` recombinant search and subsequent filtration pipeline utilizes multi-node parallelism. The degree of speedup depends on how many GCP instances the user decides to allocate towards the job, since the `MAT` long branches to search will be automatically partitioned across the given `n` machines. On each instance, once a putative list of recombinant nodes is obtained, the pipeline on that machine begins quality control and filtration checks to flag false-positive recombinants. + +D. **Post-filtration Aggegrator and Ranking** +This is the last stage of the pipeline and it occurs on your local machine, for both on-premise and GCP `RIVET` workflows. Once the recombination search and filtration steps of the pipeline have concluded on **all** instances and the local `Chronumental` job has finished, the filtered recombinant results for each partition of long branches are aggregated locally and the post-filtration stage of the pipeline can begin. During this last step, the final list of recombinants is ranked according to a [growth metric](start/table.md#recombinant-ranking-score) and also additional information on each recombinant is gathered, such as clade/lineage information, descendant samples, parsimony scores, quality control/filtration information, and more. For a full list of all information reported about each putative recombinant, please see our documentation about the [RIVET Results Table](start/table.md) + + +### RIVET Backend Input + +!!! warning + All input files should be placed in the current directory where you will launch your `RIVET` workflow. + + **If using GCP:** The following input files with the same naming as you specify in the `config.yaml` file below need to be placed in your GCP Storage Bucket (`bucket_id`) before launching the remote `RIVET` job. + +1. `UShER Mutation-Annotated Tree (MAT)`: Updated daily and can be obtained here: [SARS-CoV-2 global MAT](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) + - `RIVET` performs recombination search using `RIPPLES` over an UShER mutation-annotated tree (`MAT`). Any samples you wish to search for recombinant ancestry must first be added to the `MAT` using [UShER](https://github.com/yatisht/usher/tree/master). +
+ +2. `Sequence Metadata`: Also updated daily to match the sequences in the corresponding MAT and can be obtained here: [metadata](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) + - The sequence metadata is a `TSV` file containing information about each sample in the `MAT`, including its name, date sequenced, country sequenced, and clade/lineage information. This information is used throughout the `RIVET` backend pipeline, for inferring the recombinant ancestor emergence date for example. +
+3. `Sequence Files (FASTA):` Downloadable at the following links, for a given `$TREE_DATE` (eg. 2022-07-04) + * `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.$TREE_DATE/genbank.fa.xz` + * `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/cogUk.$TREE_DATE/cog_all.fasta.xz` +
+ +!!! info + During the `RIVET` backend quality control and filtration pipeline these sample sequence files are aligned to the SARS-CoV-2 reference and the `RIPPLES` inferred recombinantion-informative sites are inspected for bioinformatic and sequencing error quality issues to flag false-positive recombinants. + +!!! example + To download the SARS-CoV-2 `Genbank` sequences for `2022-07-04`: + ``` + wget https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.2022-07-04/genbank.fa.xz + ``` + + + +### Launch RIVET Job + + +The `RIVET` backend is setup to be run **locally on your own machine** or on **Google Cloud Platform (GCP)**, and for ease-of-use is entirely configured through the use of the `config.yaml` file. + +!!! Setup + If you would like to run your RIVET backend job on Google Cloud Platform, please see the following documentation for setting up an account: [GCP Setup Docs](gcp_setup.md) + +Copy the config file from `template/config.yaml` into the current directory and fill out the fields. More information on each field can be found below. +``` +cp template/config.yaml . +``` + +```yaml +# GCP Credentials [LEAVE EMPTY FOR LOCAL JOB] +bucket_id: +project_id: +key_file: /tmp/keys/ + +# GCP Machine and Storage Bucket Config [LEAVE EMPTY FOR LOCAL JOB] +instances: +boot_disk_size: 50 +machine_type: + +# Ripples Parameters Config [REQUIRED] +version: ripples-fast +mat: +newick: +metadata: +date: +# Local results output directory, or name of folder on GCP storage bucket +results: +reference: reference.fa + +# Additional Parameters +num_descendants: 5 +public_tree: True +verbose: False +# Default to all available threads if left empty +threads: +docker_image: mrkylesmith/ripples_pipeline:latest +generate_taxonium: False +``` + +Fill out the configuration file with the settings for your `RIVET` job. If the field is already filled in, you will likely not want to change that parameter value. + +!!! info Configuration File + For more information on each field in the `config.yaml` file please see the following page: [RIVET Backend Configuration](installation/config.md) + + +### RIVET Backend Outputs + +The pipeline will create a local results directory, based on the name given for the `results` field in `config.yaml` + +The pipeline will automatically output the following four files within your local `results` directory (and in `GCP` bucket if running remote job): + +1. `final_recombinants_.txt`: a `TSV` file containing the detected recombinants, with the recombinant node id, donor node id and acceptor node id as the first three columns in the file. The rest of the columns contain information about each detected recombinant, including clade/lineage assignments, 3SEQ M,N,K and p-values, a representative descendant (containing the fewest additional mutations with respect to the recombinant node), recombinant ranking scores, and other information to be displayed by the RIVET frontend. For more information on this file, please see the [RIVET Results Table](start/table.md) page. + +
+ +2. `trios.vcf`: VCF file containing the SNVs of each trio (recombinant and its parents) node. + +
+ +3. `sample_descedants.txt.xz`: a `TSV` file where each row contains a mapping from each trio node id (one node id per row), to a set of descendant samples corresponding to that internal node id. + +
+ +4. `.taxonium.jsonl.gz`: a jsonl file used by RIVET frontend to display the recombinant node trios within the context of the global phylogeny, powered by Taxonium and Treenome. + +
+ +!!! note + Currently the `Taxonium` view is only provided using public trees provided at: [https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) + +
+
+ +## Visualizing Your Results Using the RIVET Frontend +
+ +!!! install + Make sure you have installed the `RIVET` frontend on your machine before proceeding. + +!!! note + If you are using the `RIVET` frontend to visualize recombinants for pathogens other than SARS-CoV-2, please see the **Using RIVET for Other Pathogens** page. + +### Configuration + +`RIVET's` frontend settings can be configured using the provided YAML file, `config.yaml`. + +```yaml +# Configuration file for RIVET +### Color Schema Options #### +# Base coloring +a: '#cc0000' +g: '#cc7722' +c: '#57026f' +t: '#338333' +base_matching_reference: '#dadada' +reference_track: '#333333' + +# Recombinant-Informative Coloring for polygons/position column labels +recomb_match_acceptor: '#2879C0' +recomb_match_donor: '#F9521E' +non_informative_site: '#dadada' + +# Breakpoint Intervals +breakpoint_intervals: '#800000' + +# Genomic Coordinate Track (default all genomic regions are same color) +genomic_regions: '#33333' +# Step for tick-marks on genomic coordinate track +tick_step: 1000 + +# Pathogen +ref_seq: NC_045512.gb + +### Taxonium Tree View Options ### +date: 2023-01-31 +bucket_name: public_trees + +# Keep environment as "local" +environment: local +# If running locally, port to use +port: 2000 +``` + +!!! warning + When running `RIVET` locally, don't change the `environment` field. Also, it won't be necessary to change the `date` field or `bucket_name` field. + + +Run the following command to launch the `RIVET` frontend in your local browser. + +!!! example + Try the following example using example SARS-CoV-2 recombinants provided in the `example/` directory. + +```python +python3 rivet-frontend.py -r example/final_recombinants_example.txt -v example/trios_example.vcf -c config.yaml +``` +
+ +### Required Inputs +`-f, RECOMBINANT_RESULTS`: Input text file containing inferred recombinant nodes. First three columns in this text file must contain (1) recombinant node ID\t (2) donor node ID\t (3) acceptor node ID. Note, donor and acceptor denote the two parental nodes of the inferred recombinant. + +**Expected format:** + +| Recombinant Node ID | Donor Node ID | Acceptor Node ID | +| ------------------------- | ----------- | ---------------- | +| node_1156861 | node_1155169 | node_1167556 | +| node_1067629 | node_1021823 | node_1156861 | + +Additional columns can be provided optionally and will be included in the rendered results table, but are **not required**. + + +!!! note + The `RIVET` backend will automatically generate the necessary input files above. Follow the steps listed on the [Inferring Recombinants Using the RIVET Backend](installation/upload.md) page. However, the `RIVET` frontend can also be used independently of the backend, just ensure that the input files adhere to the expected formatting. + +
+ +`-v, VCF`: An input VCF containing single-nucleotide variants (SNVs) of all recombinant/donor/acceptor trio nodes present in the input `RECOMBINANT_RESULTS` file. + +!!! note + RIVET only supports viewing single-nucleotide variants (SNVs), and not indels or SVs. Please see the following workflow to [create a VCF](create_vcf.md) for uploading to RIVET locally. + +
+ +`-c, CONFIG`: The `config.yaml` file, shown at the top of this page, and provided in the repository. \ No newline at end of file diff --git a/docs/installation/analyze.md b/docs/installation/analyze.md new file mode 100644 index 0000000..ab26a3b --- /dev/null +++ b/docs/installation/analyze.md @@ -0,0 +1,87 @@ +# Visualizing Your Results Using the RIVET Frontend + +!!! install + Make sure you have installed the `RIVET` frontend on your machine before proceeding. The installation steps can be found here: [Install RIVET On Your Machine](installation.md#conda-install) + +!!! note + If you are using the `RIVET` frontend to visualize recombinants for pathogens other than SARS-CoV-2, please see the [Using RIVET for Other Pathogens](../start/future.md) page. + +## Configuration + +`RIVET's` frontend settings can be configured using the provided YAML file, `config.yaml`. + +```yaml +# Configuration file for RIVET +### Color Schema Options #### +# Base coloring +a: '#cc0000' +g: '#cc7722' +c: '#57026f' +t: '#338333' +base_matching_reference: '#dadada' +reference_track: '#333333' + +# Recombinant-Informative Coloring for polygons/position column labels +recomb_match_acceptor: '#2879C0' +recomb_match_donor: '#F9521E' +non_informative_site: '#dadada' + +# Breakpoint Intervals +breakpoint_intervals: '#800000' + +# Genomic Coordinate Track (default all genomic regions are same color) +genomic_regions: '#33333' +# Step for tick-marks on genomic coordinate track +tick_step: 1000 + +# Pathogen +ref_seq: NC_045512.gb + +### Taxonium Tree View Options ### +date: 2023-01-31 +bucket_name: public_trees + +# Keep environment as "local" +environment: local +# If running locally, port to use +port: 2000 +``` + +!!! warning + When running `RIVET` locally, don't change the `environment` field. Also, it won't be necessary to change the `date` field or `bucket_name` field. + + +Run the following command to launch the `RIVET` frontend in your local browser. + +!!! example + Try the following example using example SARS-CoV-2 recombinants provided in the `example/` directory. + +```python +python3 rivet-frontend.py -r example/final_recombinants_example.txt -v example/trios_example.vcf -c config.yaml +``` +## Required Inputs +`-f, RECOMBINANT_RESULTS`: Input text file containing inferred recombinant nodes. First three columns in this text file must contain (1) recombinant node ID\t (2) donor node ID\t (3) acceptor node ID. Note, donor and acceptor denote the two parental nodes of the inferred recombinant. + +**Expected format:** + +| Recombinant Node ID | Donor Node ID | Acceptor Node ID | +| ------------------------- | ----------- | ---------------- | +| node_1156861 | node_1155169 | node_1167556 | +| node_1067629 | node_1021823 | node_1156861 | + +Additional columns can be provided optionally and will be included in the rendered results table, but are **not required**. + + +!!! note + The `RIVET` backend will automatically generate the necessary input files above. Follow the steps listed on the [Inferring Recombinants Using the RIVET Backend](upload.md) page. However, the `RIVET` frontend can also be used independently of the backend, just ensure that the input files adhere to the expected formatting. + +
+ +`-v, VCF`: An input VCF containing single-nucleotide variants (SNVs) of all recombinant/donor/acceptor trio nodes present in the input `RECOMBINANT_RESULTS` file. + +!!! note + RIVET only supports viewing single-nucleotide variants (SNVs), and not indels or SVs. Please see the following workflow to [create a VCF](../create_vcf.md) for uploading to RIVET locally. + +
+ +`-c, CONFIG`: The `config.yaml` file, shown at the top of this page, and provided in the repository. \ No newline at end of file diff --git a/docs/installation/config.md b/docs/installation/config.md new file mode 100644 index 0000000..ead961a --- /dev/null +++ b/docs/installation/config.md @@ -0,0 +1,102 @@ +# RIVET Backend Configuration File + +Below you will find explainations for each field in the following `config.yaml` file. + +```yaml +# GCP Credentials [LEAVE EMPTY FOR LOCAL JOB] +bucket_id: +project_id: +key_file: /tmp/keys/ + +# GCP Machine and Storage Bucket Config [LEAVE EMPTY FOR LOCAL JOB] +instances: +boot_disk_size: 50 +machine_type: + +# Ripples Parameters Config [REQUIRED] +version: ripples-fast +mat: +newick: +metadata: +date: +# Local results output directory, or name of folder on GCP storage bucket +results: +reference: reference.fa + +# Additional Parameters +num_descendants: 5 +public_tree: True +verbose: False +# Default to all available threads if left empty +threads: +docker_image: mrkylesmith/ripples_pipeline:latest +generate_taxonium: False +``` + +### RIVET GCP Job Parameters +!!! Warning + If you are running your `RIVET` backend job on GCP, you must fill out all of the fields in this subsection. Otherwise, if you are running your `RIVET` job locally on your machine, just leave these fields blank. + +* `bucket_id`: The name of the GCP Storage Bucket where `RIVET` will find your pipeline inputs, and write the outputs of the pipeline. +
+ +* `project_id`: The name of your GCP project, where your Storage Bucket can be found. +
+ +* `key_file`: Location (path) to find a GCP authentication keys `JSON` file, that will give `RIVET` the necessary permissions to access your GCP account and storage bucket. +
+ +* `instances`: The number of GCP instances (machines) to parallelize your `RIVET` job across. RIVET will automatically partition the number of long branches in the given `MAT` across `n` instances given by this field and search for recombination events and perform filtration checks in parallel on `n` machines. +
+ +* `boot_disk_size: 50` This field should be left as `50`, and pertains only to GCP machines. +
+ +* `machine_type: n2d-highcpu-32` The types of GCP machine to use for `RIVET` job. We recommbend leaving this field as `n2d-highcpu-32` machine, since `RIVET` is optimized to take advantage of GCP compute optimized instances, but this field can be changed if desired. The list of available machines can be found at the following page: [Machine families resource and comparison guide](https://cloud.google.com/compute/docs/machine-resource) +
+ +!!! info + For more information on GCP acount setup including obtaining the necessary `key_file`, please see the [GCP Setup Docs](../gcp_setup.md) +
+ +### RIVET Specific Parameters + +* `version: ripples-fast` **Do not change this field**. We recommend using `ripples-fast`, which is a new implementation of the `RIPPLES` algorithm that produces identical results with considerable speedup. +
+ +* `mat`: The mutation-annotated tree (MAT) input phylogeny generated by [UShER](https://github.com/yatisht/usher) to search for recombination. A daily-updated database of SARS-CoV-2 mutation-annotated trees has been made available through [matUtils](https://academic.oup.com/mbe/article/38/12/5819/6361626) and can be found here: [https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/). +
+ +* `newick`: The name of the Newick tree file that will be used by the `RIVET` backend pipeline. Could be named `_tree.nwk` for example. No actual input file is required for this field, just provide the name of the file, and `RIVET` will convert to the Newick file format internally. +
+ +* `metadata`: Provide the name of the sequence metadata file you obtained here: [metadata](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/). This is a `TSV` file containing information about each sample in the `MAT`, including its name, date sequenced, country sequenced, and clade/lineage information. This information is used throughout the `RIVET` backend pipeline, for inferring the recombinant ancestor emergence date for example. +
+ +* `date`: The date corresponding to the input `MAT` and metadata files used, in the following format year-month-day. Eg.) `2023-06-01` +
+ +* `results`: The name of directory to write all `RIVET` output files to, both locally and in GCP storage bucket if running remote job. +
+ +* `reference: reference.fa` The name of the SARS-CoV-2 reference file, that will be automatically downloaded by the `RIVET` pipeline. **For SARS-CoV-2 recombination inference, we recommend not changing this field.** +
+ +* `num_descendants: 5` The minimum number of leaves that a node should have to be considered for recombination. +
+ +* `public_tree: True` This field should be set to `True` if the `MAT` was obtained at the following link: [https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/). +
+ +* `verbose: False` If set to `False`, most standard out information will be written to log files, instead of printed to the console during the pipeline execution. +
+ +* `threads`: As many stages of `RIVET` and `RIPPLES` are multithreaded, this field sets the number of threads to use when running `RIVET` locally. **If this field is left blank, the number of threads will automatically equal the number of available cores on the machine.** +
+ +* `docker_image: mrkylesmith/ripples_pipeline:latest` The public Docker image for `RIVET` that will be used when executing the pipeline on GCP. **Do not change this field.** +
+ +* `generate_taxonium: False` When set to `True`, `RIVET` will generate a [Taxonium](https://taxonium.org/) `jsonl` file that can be loaded into the [Taxonium](https://taxonium.org/) web interface or desktop app to view the global phylogeny for the given input `MAT`. + +
\ No newline at end of file diff --git a/docs/installation/images/rivet-backend.png b/docs/installation/images/rivet-backend.png new file mode 100644 index 0000000..296f4e1 Binary files /dev/null and b/docs/installation/images/rivet-backend.png differ diff --git a/docs/installation/installation.md b/docs/installation/installation.md new file mode 100644 index 0000000..98861e5 --- /dev/null +++ b/docs/installation/installation.md @@ -0,0 +1,75 @@ +# Install RIVET On Your Machine + +## Installing RIVET Backend using Docker +!!! Install + Install `Docker` on your machine first. + +For ease of use, the entire `RIVET` backend pipeline, including recombinant ranking, is contained within a pre-built public docker image. + +### Running RIVET Backend Locally On Your Machine +A `RIVET` backend job can be run locally on your machine. +To launch a Docker shell, run the following two commands. +``` +docker run -it mrkylesmith/ripples_pipeline:latest +``` +This will run an interactive `Docker` shell with the necessary `RIVET` environment. + +
+ +Type the following command to ensure your `RIVET` backend environment is configured correctly, and then proceed to the next steps for running a `RIVET` backend job: [Inferring Recombinants Using the RIVET Backend](installation/upload.md) + +``` +python3 rivet-backend.py --help +``` + +
+
+ + +### Running RIVET Backend On Google Cloud +We also provide the build-in option of running a parallelized `RIVET` job across a user specified number of Google Cloud Platform (GCP) machines. + +!!! setup + If you would like to use GCP, please see the following docs for setting up an account with Google Cloud Platform: [GCP Setup Docs](../gcp_setup.md) + +!!! important + Put your GCP service account key file (obtained following the docs linked above) in the corresponding location as the command below or update the location in the command below: + +To launch a Docker shell using GCP, run the following two commands providing your GCP Authentication keys file. + +``` +KEY=~/.config/gcloud/ +docker run -it -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/ -v ${KEY}:/tmp/keys/:ro mrkylesmith/ripples_pipeline:latest +``` + +
+
+ +## Install RIVET Frontend Locally On Your Machine + +### Clone RIVET Repo Locally +``` +git clone https://github.com/TurakhiaLab/rivet.git +cd rivet +``` + +### Conda Install +!!! Install + Install `Conda` on your machine first. + +All the `RIVET` frontend dependencies have been added to Conda environment setup, that can be found in the `install` directory. + +
+ +Run the following commands to activate the `rivet` Conda environment. +``` +conda env create -f install/rivet_env.yml +conda activate rivet +``` +
+ +Type the following command to ensure your `RIVET` frontend environment is configured correctly, and then proceed to the next steps for using the `RIVET` frontend: [Visualizing Your Results Using the RIVET Frontend](analyze.md) + +``` +python3 rivet-frontend.py --help +``` \ No newline at end of file diff --git a/docs/installation/upload.md b/docs/installation/upload.md new file mode 100644 index 0000000..43e0b85 --- /dev/null +++ b/docs/installation/upload.md @@ -0,0 +1,137 @@ +# Inferring Recombinants Using the RIVET Backend + +Infer recombinant ancestry in your own SARS-CoV-2 sequences using `RIVET's` backend. + +!!! Installation + Make sure `RIVET` is installed on your local machine before proceeding, otherwise install `RIVET` first by following these steps: [Install RIVET On Your Machine](installation.md) + +
+ +## RIVET Backend + +The `RIVET` backend uses [RIPPLES](https://www.nature.com/articles/s41586-022-05189-9) for SARS-CoV-2 recombination detection. For more information on the `RIPPLES` algorithm please see: [Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape](https://doi.org/10.1038/s41586-022-05189-9) + + + +### RIVET Architecture +A. **RIPPLES Job Orchestrator** +When running a `RIVET` job on Google Cloud Platform (GCP), `RIVET` calculates the number of long branches in the input mutation-annotated tree and partitions them across `n` GCP instances, which is a parameter specified by the user. This stage of the pipeline is responsible for setting up and launching these parallel jobs, as well as monitoring their progress as they run. This stage of the pipeline also initiates a [Chronumental](https://github.com/theosanderson/chronumental) job, to run concurrently as a subprocess on the local machine, which is explained in the following part B. + + +B. **Infer MAT nodes ancestral dates** +In order to infer the emergence of detected ancestral recombinant nodes of interest for ranking and epidemiological prioritization, `RIVET` builds a time-tree using the [Chronumental](https://www.biorxiv.org/content/10.1101/2021.10.27.465994v1) method. This method uses the sample dates provided in the sequence metadata file to build a probabilistic +model for length of time across branches in the tree and is able to infer the dates of all internal nodes in the tree. `RIVET` uses these dates for internal nodes that we label as recombinants. + +C. **Mult-node GCP Workflow** +When running a `RIVET` job on GCP, the `RIPPLES` recombinant search and subsequent filtration pipeline utilizes multi-node parallelism. The degree of speedup depends on how many GCP instances the user decides to allocate towards the job, since the `MAT` long branches to search will be automatically partitioned across the given `n` machines. On each instance, once a putative list of recombinant nodes is obtained, the pipeline on that machine begins quality control and filtration checks to flag false-positive recombinants. + +D. **Post-filtration Aggegrator and Ranking** +This is the last stage of the pipeline and it occurs on your local machine, for both on-premise and GCP `RIVET` workflows. Once the recombination search and filtration steps of the pipeline have concluded on **all** instances and the local `Chronumental` job has finished, the filtered recombinant results for each partition of long branches are aggregated locally and the post-filtration stage of the pipeline can begin. During this last step, the final list of recombinants is ranked according to a [growth metric](https://turakhialab.github.io/rivet/start/table.html#recombinant-ranking-score) and also additional information on each recombinant is gathered, such as clade/lineage information, descendant samples, parsimony scores, quality control/filtration information, and more. For a full list of all information reported about each putative recombinant, please see our documentation about the [RIVET Results Table](https://turakhialab.github.io/rivet/start/table.html). + + +## RIVET Backend Input + +!!! warning + All input files should be placed in the current directory where you will launch your `RIVET` workflow. + + **If using GCP:** The following input files with the same naming as you specify in the `config.yaml` file below need to be placed in your GCP Storage Bucket (`bucket_id`) before launching the remote `RIVET` job. + +1. `UShER Mutation-Annotated Tree (MAT)`: Updated daily and can be obtained here: [SARS-CoV-2 global MAT](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) + - `RIVET` performs recombination search using `RIPPLES` over an UShER mutation-annotated tree (`MAT`). Any samples you wish to search for recombinant ancestry must first be added to the `MAT` using [UShER](https://github.com/yatisht/usher/tree/master). +
+ +2. `Sequence Metadata`: Also updated daily to match the sequences in the corresponding MAT and can be obtained here: [metadata](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) + - The sequence metadata is a `TSV` file containing information about each sample in the `MAT`, including its name, date sequenced, country sequenced, and clade/lineage information. This information is used throughout the `RIVET` backend pipeline, for inferring the recombinant ancestor emergence date for example. +
+3. `Sequence Files (FASTA):` Downloadable at the following links, for a given `$TREE_DATE` (eg. 2022-07-04) + * `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.$TREE_DATE/genbank.fa.xz` + * `https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/cogUk.$TREE_DATE/cog_all.fasta.xz` +
+ +!!! info + During the `RIVET` backend quality control and filtration pipeline these sample sequence files are aligned to the SARS-CoV-2 reference and the `RIPPLES` inferred recombinantion-informative sites are inspected for bioinformatic and sequencing error quality issues to flag false-positive recombinants. + +!!! example + To download the SARS-CoV-2 `Genbank` sequences for `2022-07-04`: + ``` + wget https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.2022-07-04/genbank.fa.xz + ``` + + + + + +## Launch RIVET Job + + +The `RIVET` backend is setup to be run **locally on your own machine** or on **Google Cloud Platform (GCP)**, and for ease-of-use is entirely configured through the use of the `config.yaml` file. + +!!! Setup + If you would like to run your RIVET backend job on Google Cloud Platform, please see the following documentation for setting up an account: [GCP Setup Docs](../gcp_setup.md) + +Copy the config file from `template/config.yaml` into the current directory and fill out the fields. More information on each field can be found below. +``` +cp template/config.yaml . +``` + +```yaml +# GCP Credentials [LEAVE EMPTY FOR LOCAL JOB] +bucket_id: +project_id: +key_file: /tmp/keys/ + +# GCP Machine and Storage Bucket Config [LEAVE EMPTY FOR LOCAL JOB] +instances: +boot_disk_size: 50 +machine_type: + +# Ripples Parameters Config [REQUIRED] +version: ripples-fast +mat: +newick: +metadata: +date: +# Local results output directory, or name of folder on GCP storage bucket +results: +reference: reference.fa + +# Additional Parameters +num_descendants: 5 +public_tree: True +verbose: False +# Default to all available threads if left empty +threads: +docker_image: mrkylesmith/ripples_pipeline:latest +generate_taxonium: False +``` + +Fill out the configuration file with the settings for your `RIVET` job. If the field is already filled in, you will likely not want to change that parameter value. + +!!! info Configuration File + For more information on each field in the `config.yaml` file please see the following page: [RIVET Backend Configuration](config.md) + + +## RIVET Backend Outputs + +The pipeline will create a local results directory, based on the name given for the `results` field in `config.yaml` + +The pipeline will automatically output the following four files within your local `results` directory (and in `GCP` bucket if running remote job): + +1. `final_recombinants_.txt`: a `TSV` file containing the detected recombinants, with the recombinant node id, donor node id and acceptor node id as the first three columns in the file. The rest of the columns contain information about each detected recombinant, including clade/lineage assignments, 3SEQ M,N,K and p-values, a representative descendant (containing the fewest additional mutations with respect to the recombinant node), recombinant ranking scores, and other information to be displayed by the RIVET frontend. For more information on this file, please see the [RIVET Results Table](https://turakhialab.github.io/rivet/start/table.html) page. + +
+ +2. `trios.vcf`: VCF file containing the SNVs of each trio (recombinant and its parents) node. + +
+ +3. `sample_descedants.txt.xz`: a `TSV` file where each row contains a mapping from each trio node id (one node id per row), to a set of descendant samples corresponding to that internal node id. + +
+ +4. `.taxonium.jsonl.gz`: a jsonl file used by RIVET frontend to display the recombinant node trios within the context of the global phylogeny, powered by Taxonium and Treenome. + +
+ +!!! note + Currently the `Taxonium` view is only provided using public trees provided at: [https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/](https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/) \ No newline at end of file diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js new file mode 100644 index 0000000..080801e --- /dev/null +++ b/docs/javascripts/mathjax.js @@ -0,0 +1,16 @@ +window.MathJax = { + tex: { + inlineMath: [["\\(", "\\)"]], + displayMath: [["\\[", "\\]"]], + processEscapes: true, + processEnvironments: true + }, + options: { + ignoreHtmlClass: ".*|", + processHtmlClass: "arithmatex" + } +}; + +document$.subscribe(() => { + MathJax.typesetPromise() +}) diff --git a/docs/start/features.md b/docs/start/features.md new file mode 100644 index 0000000..3bca548 --- /dev/null +++ b/docs/start/features.md @@ -0,0 +1,132 @@ +# Web Interface Walkthrough + +## Selecting Recombinant of Interest +Each row in the results table represents an inferred recombinant. You can **horizontally scroll** to the right to view more columns in the table, and **click** a row to select the recombinant you are interested in visualizing. + + + +For detailed information on each column of the results table, please see the [RIVET Results Table](https://turakhialab.github.io/rivet/start/table.html) page. + +## Results Table Next and Previous Buttons +Use the `next` and `previous` buttons shown below to skip to the next recombinant result (next row) and SNV visualization in the table. + + +!!! tip + You can also use the arrow keys instead of the `next` and `previous` buttons. Use the right arrow key :arrow_forward: and left arrow key :arrow_backward: to skip to `next` and `previous` results respectively. + + +## Sort by Column +The results can be sorted by any column, by **clicking on the column title**, shown below: + +
+ +!!! note + By default, the results are ranked by the `Recombinant Ranking Score`. + + +## Search Table +The table can be searched and the results shown will be filtered down based on the given query. For example, if you want to search for all recombinant results with `XBB` lineage classification just type `XBB` into the search bar. + +## Search by Sample ID +A user can search for recombinant ancestry in specific samples by using the search by sample identifer feature. Click the toggle button to its active state, and then enter the sample identifier into the search bar. When the `Search by sample` toggle is active, normal table search will be disabled and all search queries should be sample identifiers. + +!!! note + Once you have entered the sample identifier into the search bar, it may take a few seconds for the table results to refresh with the results of your query. + + + + +## SNV plot +When a user clicks on a row to select a recombinant of interest the following visualization, shown below, will be rendered. + + + +The above visualization shows all of the single-nucleotide variant (SNV) sites in the recombinant sequence and its two parents (donor/acceptor), with respect to the reference sequence. The recombinant-informative sites are highlighted in orange where the recombinant matches the donor, and blue where the recombinant matches the acceptor. The gene region annotations are shown below the trio sequences in the bottom track. + + +## Query Descendants +For a selected recombinant ancestor node of interest, you might want to query which samples are descendants of this inferred recombinant. Simply **click** the `Recombinant` label to the left of the track to view up to 10,000 sample descendants of that particular recombinant, as seen in the screenshots below. + +You can also click the `Donor` and `Acceptor` labels to query the samples that are descendants of those particular parental nodes. + + + +
+ +The side panel will display the 10,000 sample descendants by default, and you can **click** the `Download Descendants` button to download a `.txt` file containing all sample descendants for the selected trio node. (one per line) + + + +
+ +## Taxonium View +View trio sequences (recombinant/donor/acceptor) in Taxonium/Treenome Global Phylogeny. +!!! note + + The Taxonium view feature is currently only available for public tree results. +
+ + + +
+ +The `Recombinant/Donor/Acceptor` nodes are circled in the global tree. Click the magnify button shown in the image below to zoom into the particular node of interest. + + +
+ +## View UShER Subtree +This feature will take you to the [UCSC UShER](https://genome.ucsc.edu/cgi-bin/hgPhyloPlace) tool, where you can view the tree using [UShER](https://github.com/yatisht/usher). This feature will automatically sample 10 descendants from the recombinant node in order to view the subtree. + + + +!!! warning + + This feature will open a new tab to `UShER` and may take a few minutes to load in the new tab. + +Once finished loading, you will see the following page, where you can view the subtree by clicking `view downsampled global tree in Nextstrain`.
+ + + +
+ + + +
+ +## Recombinant Detailed Overview +To view more even more detailed information about a particular recombinant of interest click the `More Info` button in the `Overview` section. + + +
+ +**This will display the following information:** + +* Current Recombinant Lineage +* Recombinant Origin Date (as inferred by [Chronumental](https://doi.org/10.1101/2021.10.27.465994)) +* Recombinant parental lineages +* Number of sequences descendant from this recombinant +* Earliest descendant sequence +* Most recent descendant sequence +* Countries where descendant sequences have been detected +* Quality Control Checks not passing (otherwise PASS if all QC checks pass) + + + +!!! question + + If there is additional information you would like to know for a particular recombinant of interest, please make this suggestion through a [GitHub Issue](https://github.com/TurakhiaLab/rivet/issues) in our repository. + +
+ +## View Amino Acid Sites +This option shows the amino acid mutations matched with their corresponding nucleotide mutation positions. This feature uses `matUtils summary --translate`, which is built automatically into the `RIVET` backend pipeline. In short, `matUtils` provides a method to compute the correct amino acid translations at each node in the tree, which `RIVET` uses to obtain the amino acid mutations for a given recombinant ancestor node. + +For more information on this method, please see the following [matUtils documentation](https://usher-wiki.readthedocs.io/en/latest/tutorials.html#example-amino-acid-translation-workflow). + + + +
+All coding amino acid translations are annotated above each corresponding SNV position (if any). + + diff --git a/docs/start/filtration.md b/docs/start/filtration.md new file mode 100644 index 0000000..67c19bb --- /dev/null +++ b/docs/start/filtration.md @@ -0,0 +1,25 @@ +# Quality Control and Filtration Checks + +## 3SeqP02 +P-value from 3-seq > 0.2. + +## russPval005 +False-discovery rate (FDR) of the parsimony improvement > 0.05. (See [Supplementary Text S3 of RIPPLES](https://www.nature.com/articles/s41586-022-05189-9#MOESM1) for details of the null model.) + +## Alt +"Alternate": Other recombination trios with the same recombination node have more parsimony improvement, fewer possible breakpoint intervals, or better P-values. + +## cluster +All recombination informative mutations occur within a span of 20 nucleotides. + +## redundant +More than two of the recombination node, donor node, and acceptor node appear in that of another trios. + +## Informative_sites_clump +More than 5 recombination-informative mutations in a 20-nucleotide span. + +## Suspicious_mutation_clump +More than 6 mutations (or 3 near indels) in a 20-nucleotide span on any of the donor node, the aceptor node or the recombination node. + +## Too_many_mutations_near_INDELs +Too many mutations on 100-nt spans near indels or a string of Ns. \ No newline at end of file diff --git a/docs/start/future.md b/docs/start/future.md new file mode 100644 index 0000000..d13dcbd --- /dev/null +++ b/docs/start/future.md @@ -0,0 +1,83 @@ +# Using RIVET for Other Pathogens + +Below are two examples of using `RIVET's` backend pipeline to infer and visualize recombinants of other pathogens beyond SARS-CoV-2. + +!!! warning + Currently, `RIVET's` backend QC/filtration pipeline is specific to SARS-CoV-2 and will not run when using the `RIVET` backend for other pathogens. + + +## Human Respiratory Syncytial Virus (HRSV) Subgroup A + +Below are the steps followed to infer putative recombinants in an `RSV` mutation-annotated tree (MAT). + + + +### Using RIVET frontend for Visualization + + + + + +Since the SNV plot for RSV includes many sites, only the region up to around position 1000 is shown in the image above. +**Please click the download button below to view the entire `RSV` SNV plot as an SVG image.** + + + + + +
+ + +## Monkeypox Virus + +Edit the following fields in the `config.yaml` file: + +Change the GenBank file from the default SARS-CoV-2 file to the corresponding GenBank file for your pathogen of interest, Monkeypox virus in this case. + +```yaml +# Pathogen Ref Seq GenBank file +ref_seq: monkeypox.gb +``` + +!!! Warning + Make sure the `environment` field is set to `local`. + +```yaml +environment: local +port: 2000 +``` +If desired, you can change the local `port` at which `RIVET` will host the local HTTP server in your browser. + +
+ +Now run the following command and RIVET will automatically open your browser to launch the frontend results table and SNV visualization. +``` +python3 rivet-frontend.py -r recombination_mpxv.2023-07-01.tsv -v mpxv.2023-07-01.vcf -c config.yaml +``` + +Below is the SNV plot we get for one of the monkeypox virus inferred recombinants. + + + +
+ +!!! check + For pathogens with larger genomes than SARS-CoV-2, you may want to change the step interval of genomic coordinate tick marks. This can be done by changing the `tick_step` field in `RIVET` frontend `config.yaml` file. diff --git a/docs/start/images/RIVET-table-columns.png b/docs/start/images/RIVET-table-columns.png new file mode 100644 index 0000000..e63d10c Binary files /dev/null and b/docs/start/images/RIVET-table-columns.png differ diff --git a/docs/start/images/XBG-snv-plot.png b/docs/start/images/XBG-snv-plot.png new file mode 100644 index 0000000..753fe25 Binary files /dev/null and b/docs/start/images/XBG-snv-plot.png differ diff --git a/docs/start/images/monkeypox-snv.png b/docs/start/images/monkeypox-snv.png new file mode 100644 index 0000000..908c7a4 Binary files /dev/null and b/docs/start/images/monkeypox-snv.png differ diff --git a/docs/start/images/next-prev-buttons.png b/docs/start/images/next-prev-buttons.png new file mode 100644 index 0000000..7fd2501 Binary files /dev/null and b/docs/start/images/next-prev-buttons.png differ diff --git a/docs/start/images/query-desc-select.png b/docs/start/images/query-desc-select.png new file mode 100644 index 0000000..3d42eb8 Binary files /dev/null and b/docs/start/images/query-desc-select.png differ diff --git a/docs/start/images/query-desc.png b/docs/start/images/query-desc.png new file mode 100644 index 0000000..e5e77a3 Binary files /dev/null and b/docs/start/images/query-desc.png differ diff --git a/docs/start/images/row-select.png b/docs/start/images/row-select.png new file mode 100644 index 0000000..1dd6c78 Binary files /dev/null and b/docs/start/images/row-select.png differ diff --git a/docs/start/images/rsv-snv.png b/docs/start/images/rsv-snv.png new file mode 100644 index 0000000..907c1bf Binary files /dev/null and b/docs/start/images/rsv-snv.png differ diff --git a/docs/start/images/rsv-snv.svg b/docs/start/images/rsv-snv.svg new file mode 100644 index 0000000..1ef724c --- /dev/null +++ b/docs/start/images/rsv-snv.svg @@ -0,0 +1 @@ +Single-nucleotide variation in the recombinant and its parents01,0002,0003,0004,0005,0006,0007,0008,0009,00010,00011,00012,00013,00014,00015,222Genomic CoordinateGene Annotations
F
G
L
M
M2
N
NS1
NS2
P
SHRecombinant matches acceptorRecombinant matches donorNon-Recombinant-InformativeAATCACTAATCATGCTAGCAAATACTATTACCTTTGTAGGCGGATACATACCCCCCTGTATTTTCGGACGCTCCCCATACCAAGCATATCCTATTCATCTACTTAGGGTTCGCTAAGGTTTCTAACTTTACCCAAACTCATTCATGTTTCTCACACCAAACTCTTCACTGCATTGTTTGGCCATTCCTCTTGACTCTCCGTCCATCCCTTCGTATCCGTTTTCTCTTCTTCCCGTCTAACAACTATGCATTTTGGCCAAGATTGCGTCACCTACTGATCCACTAATGCTCAACTACCCACTTTAAGCGCCAAGTTTTCTTGGTTCCTCATACGATACATAATGTTCCCGTACTTCTTGCGGTACAGAAGAGCACGACAAGCATTCCAGGCATCATGTATCACTCGGGACGGGCCCGCACCCAGCTACACCCCTTTTAACGGTTTCCGCCAGGCTACCTTAAGATTAGCTATCCCTGGAATAGTTTTCTTGCCCCCTCACGCCTAGTTAGGGAACATACCCCGACTACGATCTGTReferenceGACTACAAGTACCATTGGTGGGCATCGGTGTTCCCACGAACTAGTGCGCATTTTTTTGTGCCGACAATTATCACTCGAATTAAACGCGCCAGACATCCTTGCGTCGAACCTGTCGGAAACCTTGAATATGACTCCGTCCCCCAGGATCCTCTGTGTCGTGTCAACTGTCATGCCACTCAGTCACCTTCTCCATCTCCTCATCTTCCATACTAGGAATGAACCCAATTTTCTAAACTAGGTGAACCCATGCCTCAATGGCATGCACGTGGCCCTTAAGATCGCCGGCACCCGGTCGTCCACCCCGAACATTAGGTCCCTCCAGCTTCCATTAAACTCTGAAACTCCTTCACAACCCCCATAAAACAATGAAATGTAATGATTTCCTTAAATGCTGCTCGTTGTCTAAAGTAAATTCATGTCAAATCATGTTCCCCCCCGTAAAGCCTGTTGAATCTTTCTGGGGCCGGTCGCCTTTAAGTTAATTTTTCTATTTTTCTATGCTCGACCGTAAGGCGTATTTTATCCATAGATAACDonorGGCTGATGACCCCGTCGACGGATGCTGGCACTTTCACAGGTTGACGTACGTCCTCCCACATTTATGGATGTCTACTAAGTTGTAAGTGTAAGTCACCTCCGTGCAAGGCTCTTTAGAGCTTCCAGCCACACTCAAATCAACTAAGGCTTCTTATATTAAGCCCTTCACCGCGTCATGTAATTGTCTTCTTCGAACTCCTACTCACTTTTCCGCGAACATTTCTTCCCCCTCACGTTAAACAGCCACGCGTTCTAATGGAAAGTATACGATATACTGAGTTAATGGTGTTTAGCCATTTCATCTGGATATCGGACTTTTTTAATCCTTCACGCGAACCAAGGTGTCTGTGTGATTTCTGCGGTGTGGAAGGGCGCAGTATGTATTTCTGACGTTATGTACTGCTCGGGACGGACTTACGCTCTACTGTATCTTTTCCAGCGAAGTATACCAGACCACCCCAAAATTAACTATTCTAGGAACGGCCCCCCCGTCCTTCTGCAACCAACTAGGAAATAGGCCTCGATTGCGGATCGCRecombinantAGCTGATGACCCCGTCGACGGATGCTGGCACTTTCATAGGCGGATGCACATCCCCCTGTATTTTCGGACGTTCCCCATACTAAGCGTGTCATACACCTCCGTGCAAGGCTCTTTAGAGCTTCCAGCCACACTCAAATCAACTAAGGCTTCTTATATTAAGCCCTTCACCGCGTCATGTAATTGTCTTCTTCGAACTCCTATTCACTTTTCCGCGAACATTTCTTCCCCCTCACGTTAAACAGCCACGCGTTCTAATGGAAAGTATACGATATACTGAGTTAATGGTGTTTAGCCATTTCATCTGGATATCAGACTTTTTTAATCCTTCACGCGAACCAAGGTGTCTGTGTGATTTCTGCGGTGTGGAAGGGCGCAGTATGTATTTCTGACGTTATGTACTGCTCGGGACGGACTTACGCTCTACTGTATCCTTTCCAGCGAAGTATACCAGACCACCCCAAAATTAACTATTCTAGGAACGGCCCCCCCGTCCTTCTGCAACCAACTAGGAAATAGGCCTCGATTACGGATTGCAcceptorClick below to view descendants30113194215344411527556580612639650651687702704740753768791837846861909915966100110151028102910331041104210521054106811101118111911391199134913581424145114571595163116461652166416821730175117851790183519071911192219531961199420272099214721622186224123932438252825342560256326202621264226662693270527262741280728322856291930233073308030823086309031023124313231353136315332033205321132223224322632443250328133273386346834983533353635633566363836653677368636893713382438633914397740434053405640574095409841004107411141474152417541894213423442654280428542894293441544344471449145044510451145134514451745234539454045434554455545684572457645834594459845994648465446704717472447314754477247804799485648994963496749724987498849894993499850075019504850505061506950865104510951285157518251915203526952845293530953315347535153655384541854285435543654405456546054725480549155085518552655275540554455545555556455735576557755865588559055955597560456055611562356255641567156825705570657185720572357325733580458225964596759696024603260766077611461386146619161926266628462996308631164106419643465396584659666066641665066716746679568096810682468396977699569997022705271067148716671677203722372897302731073437384744974907562757775807592765376657713776478247827786979267935794479507964810981278141816082118233824282568289830283188344835083598360837083808390839483988407844084458449845484588460846184628464847184778486848784888566857886748738874687968797880788548890890589328956898289929046907690919124916891759187919292089214922392389370944894639661976198309841986999491000910042101051011710129101681017410189102521031810396104111042310501105461063310654107571081610825108281084910951109581101711053110871114011305113121131811332113481144011506115661158811719117551179411800118571192311956120351204312067121961220912223122711233712352124331244512463124661247212499125021257112589126431267912886129341294013003130301308113144131491315613174131861327913322133571346613519135851363913649136581366413687137511376313795137961380113805138101381513825138301384913864138881389113927140171403514089141071417614299143081431514350143711440314437145031450614509145391454314545145481462014767148211486914902149321496915004150131502415032150391507615117151281513315151151541517515187 \ No newline at end of file diff --git a/docs/start/images/search-by-sample.png b/docs/start/images/search-by-sample.png new file mode 100644 index 0000000..077f509 Binary files /dev/null and b/docs/start/images/search-by-sample.png differ diff --git a/docs/start/images/select-aa.png b/docs/start/images/select-aa.png new file mode 100644 index 0000000..34d5af6 Binary files /dev/null and b/docs/start/images/select-aa.png differ diff --git a/docs/start/images/select-info.png b/docs/start/images/select-info.png new file mode 100644 index 0000000..c4e8fea Binary files /dev/null and b/docs/start/images/select-info.png differ diff --git a/docs/start/images/select-taxonium.png b/docs/start/images/select-taxonium.png new file mode 100644 index 0000000..3d3e346 Binary files /dev/null and b/docs/start/images/select-taxonium.png differ diff --git a/docs/start/images/select-usher.png b/docs/start/images/select-usher.png new file mode 100644 index 0000000..f8a5a00 Binary files /dev/null and b/docs/start/images/select-usher.png differ diff --git a/docs/start/images/view-aa.png b/docs/start/images/view-aa.png new file mode 100644 index 0000000..9b53678 Binary files /dev/null and b/docs/start/images/view-aa.png differ diff --git a/docs/start/images/view-info.png b/docs/start/images/view-info.png new file mode 100644 index 0000000..c21a464 Binary files /dev/null and b/docs/start/images/view-info.png differ diff --git a/docs/start/images/view-nextstrain.png b/docs/start/images/view-nextstrain.png new file mode 100644 index 0000000..208c4a8 Binary files /dev/null and b/docs/start/images/view-nextstrain.png differ diff --git a/docs/start/images/view-taxonium.png b/docs/start/images/view-taxonium.png new file mode 100644 index 0000000..71e3428 Binary files /dev/null and b/docs/start/images/view-taxonium.png differ diff --git a/docs/start/images/view-usher-subtree.png b/docs/start/images/view-usher-subtree.png new file mode 100644 index 0000000..154951c Binary files /dev/null and b/docs/start/images/view-usher-subtree.png differ diff --git a/docs/start/table.md b/docs/start/table.md new file mode 100644 index 0000000..b6f29fb --- /dev/null +++ b/docs/start/table.md @@ -0,0 +1,111 @@ +# RIVET Results Table +Each of the sections below describes the columns of RIVET's results table of inferred recombinant ancestors. + +## Recombinant Node ID +* UShER assigned node id for inferred recombinant node + +## Donor Node ID +* UShER assigned node id for donor (recombinant parentental node) + +## Acceptor Node ID +* UShER assigned node id for acceptor (recombinant parentental node) + +## Breakpoint 1 Interval +* RIPPLES inferred breakpoint interval 1 + +## Breakpoint 2 Interval +* RIPPLES inferred breakpoint interval 2 + +!!! info + + For more information on the `RIPPLES` algorithm, please see: [Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape](https://www.nature.com/articles/s41586-022-05189-9) + + +## Recombinant Clade +* Recombinant clade classification as assigned by `Nextstrain` + +## Recombinant Lineage +* Recombinant lineage designation as assigned by `Pangolin` + +## Donor Clade +* Donor clade classification as assigned by `Nextstrain` + +## Donor Lineage +* Donor lineage designation as assigned by `Pangolin` + +## Acceptor Clade +* Acceptor clade classification as assigned by `Nextstrain` + +## Acceptor Lineage +* Acceptor lineage designation as assigned by `Pangolin` + +## Chronumental-inferred origin date +* Inferred first emergence of recombinant ancestor sequence using the [Chronumental](https://github.com/theosanderson/chronumental) method, which runs automatically as part of the `RIVET` pipeline. In short, `Chronumental` is a accurate and scalable time-tree estimation method that uses stochastic gradient descent to estimate lengths of time for tree branches under a probabilistic model. For more information on this method, please see the [Chronumental](https://doi.org/10.1101/2021.10.27.465994) paper. + +## Recombinant Ranking Score +* The ranking score represents a **growth score** that we compute for each inferred recombinant, which is designed to help prioritize recently emerging recombinants and recombinants with many descendant circulating sequences. +* By default, we order the main `RIVET` results table by maximum ranking score, which attempts to prioritize highest concern recombinants of interest at the top of the list. + +The recombinant **growth metric** below, *G(R)*, for a recombinant node with a set of descendants *S* is defined below: + +$$ \ G(R) = 2^{-m(R)} * \sum_{s\in S} 2^{-m(s)} $$ + +In the equation above, and correspond to the number of months (30-day intervals) *𝑚(𝑅)* *𝑚(𝑠)* +elapsed since the recombinant node was inferred to have originated and its descendant *𝑅* +sequence was sampled, respectively. The growth score above, *G(R)*, is computed for each +detected recombinant *R*, and the final recombinant list is ranked based on descending growth +scores. + +## Representative Descendant +* This selected sample is a descendant with the fewest additional mutations as compared to it's recombinant ancestor. + + +## Informative Site Sequence +* The informative site sequence is a binary string of `A` and `B` for each trio sequence, where an `A` is assigned if the recombinant node allele at the site matches only the donor node allele at that site, or a `B` if the recombinant matched only the acceptor. + + +## 3SEQ (M, N, K) +* 3SEQ M, M, K values used to check individual p-values in a pre-generated 3SEQ p-value table. + + +## 3SEQ P-Value + +!!! info + For more information on the `3SEQ` method and its use in `RIPPLES`, please see [Improved Algorithmic Complexity for the 3SEQ Recombination Detection Algorithm](https://academic.oup.com/mbe/article/35/1/247/4318635) and the Supplementary Section of [Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape](https://www.nature.com/articles/s41586-022-05189-9#MOESM1) + + + + +## Original Parsimony Score +* The original parsimony placement score on the global phylogeny. + +## Parsimony Score Improvement +* Highest parsimony score improvement relative to original parsimony score. + + +## Quality Control (QC) Flags +* This column represents quality control (QC) or filtration checks that where flagged, meaning that this inferred recombinant is not high-confidence and could represent a false-positive recombinant resulting from bioinformatic, contamination or other sequencing errors. + + +!!! info + + For detailed description of each quality control and filtration check performed in `RIVET's` backend pipeline, see the [Quality Control and Filtration Checks](filtration.md) page. + +**Common sources of false positive errors in `RIVET’s` pipeline include, but are not limited to:** + +* Contamination, sequencing, or assembly errors in the recombinant or parent sequences +* Missing sequences resulting in artificially long branches in the UCSC public tree +* Misalignments or phylogenetic inconsistencies + + +**Common sources of false negative errors in `RIVET’s` pipeline include, but are not limited to:** + +* Too few recombination-informative sites in the recombinant +* More than two breakpoints are required to explain the recombinant +* Too few descendants of the recombinant or its parent in the UCSC public tree + + +## "Click to View" Taxonium +* When clicked, this button will open a separate tab launching the [Taxonium](https://taxonium.org/) browser in order to view the particular recombinant trio (recombinant/donor/acceptor) in the context of the global phylogeny. +In short, [Taxonium](https://elifesciences.org/articles/82392) is a visualization tool for exploring large trees. + diff --git a/install/rivet_env.yml b/install/rivet_env.yml index 86f069b..c793b00 100644 --- a/install/rivet_env.yml +++ b/install/rivet_env.yml @@ -14,3 +14,4 @@ dependencies: - chronumental - cyvcf2 - termcolor + - mkdocs-material diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..1ec4ae6 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,94 @@ +site_name: RIVET Wiki +repo_name: TurakhiaLab/rivet +repo_url: https://github.com/TurakhiaLab/rivet + +theme: + name: material + features: + - announce.dismiss + - content.action.edit + - content.action.view + - content.code.annotate + - content.code.copy + - content.tooltips + - navigation.footer + - navigation.expand + - navigation.tabs.sticky + - navigation.instant.prefetch + - navigation.tracking + - search.highlight + - search.share + - search.suggest + - toc.follow + - toc.integrate + language: en + palette: + - scheme: default + primary: indigo + accent: indigo + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - scheme: slate + primary: indigo + accent: indigo + toggle: + icon: material/brightness-4 + name: Switch to light mode + + favicon: images/rivet-icon.png + logo: images/rivet-icon.png + + icon: + admonition: + note: octicons/tag-16 + info: octicons/info-16 + tip: octicons/squirrel-16 + success: octicons/check-16 + question: octicons/question-16 + warning: octicons/alert-16 + bug: octicons/bug-16 + example: octicons/beaker-16 + quote: octicons/quote-16 + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/TurakhiaLab/rivet + +markdown_extensions: + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - admonition + - pymdownx.arithmatex: + generic: true + - footnotes + - pymdownx.details + - pymdownx.superfences + - pymdownx.mark + - attr_list + - pymdownx.emoji: + emoji_index: !!python/name:materialx.emoji.twemoji + emoji_generator: !!python/name:materialx.emoji.to_svg + +use_directory_urls: false +nav: + - Home: index.md + #- RIVET Web Interface: + # - start/features.md + # - start/table.md + # - start/filtration.md + # - start/future.md + #- Use RIVET Locally: + # - installation/installation.md + # - installation/upload.md + # - installation/analyze.md + #- Contributing: + # - contributing/contributing.md + +extra_javascript: + - javascripts/mathjax.js + - https://polyfill.io/v3/polyfill.min.js?features=es6 + - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js \ No newline at end of file