GitHub - ajodeh-juma/rvfvtyping: Classification and phylogenetic lineage assignment of Rift Valley fever virus consensus genomes using the glycoprotein Gn/G2 gene found within the M-segment of the virus genome

Introduction

Introduction

rvfvtyping is a bioinformatics analysis pipeline for classification and phylogenetic lineage assignment of Rift Valley fever virus consensus genomes using the glycoprotein Gn/G2 gene found within the M-segment of the virus genome.

Classifying query sequences involves two steps. The first step is the identification of the virus species and the second is the assignment of Rift Valley fever virus lineages through phylogenetic analysis. Classification of query sequences is performed using diamond while phylogenetic assignment uses iqtree, and is largely adopted from the initial pangolin method developed by Áine O'Toole.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

A web application of the pipeline is hosted on a dedicated server at the University of KwaZulu Natal and can be found here

Installation

rvfvtyping runs on UNIX/LINUX systems. You will install Miniconda3 from here. Once Miniconda3 has been installed, proceed with pipeline installation

git clone https://github.com/ajodeh-juma/rvfvtyping.git
cd rvfvtyping
conda env create -n rvfvtyping-env -f environment.yml
conda activate rvfvtyping-env

Testing

Optional: Test the installation on a single FASTA

nextflow run main.nf -profile test
Optional: Test the installation on several FASTA sequence files

nextflow run main.nf -profile test_full

Usage

For minimal pipeline options, use the --help flag e.g.

nextflow run main.nf --help

To see all the options, use the --show_hidden_params flag e.g.

nextflow run main.nf --help --show_hidden_params

A typical command to classify and assign lineages using the glycoprotein (Gn) classifier

nextflow run main.nf \
   --input 'data/test/*.fa' \
   --segment Gn \
   --outdir output-dir \
   -work-dir work-dir \

Method details

The pipeline offers several parameters including as highlighted:

Input/output options
  --input                      [string]  Input Fasta file for typing
  --segment                    [string]  genomic segment of the virus. options are 'Gn', 'S', 'M' and 'L'
  --outdir                     [string]  The output directory where the results will be saved. [default: ./results]
  --email                      [string]  Email address for completion summary.

Diamond options
  --skip_diamond               [boolean] Skip all DIAMOND BLAST against the pre-configured database.

mandatory parameters

parameter	description	type
`--input`	Input Fasta file(s) format `.fa` or `.fasta` for typing	`string`
`--segment`	genomic segment of the virus. `Gn`, `S`, `M`, `L`	`string`

Output

Several output files will be generated including a comma-separated values file (lineages.csv) will be a csv file with taxon name and lineage assigned for each input query sequence per line

e.g.

Query	Lineage	aLRT	UFbootstrap	Length	Note	Year_first	Year_last	Countries
DQ380218	G	84	70	3885	assigned (bootstrap value >= 70)	1969	1993	Senegal;CAR;Zimbabwe;Guinea
HM587118	L	99	100	490	assigned (bootstrap value >= 70)	1963	1995	Zimbabwe;Egypt;South Africa;Kenya
DQ380221	D	92	98	3885	assigned (bootstrap value >= 70)	1973	1973	CAR
DQ380222	J	77	27	3885	unassigned (bootstrap value < 70)
HM587045	B	89	97	490	assigned (bootstrap value >= 70)	1972	1972	Kenya
DQ380189	L	99	100	3885	assigned (bootstrap value >= 70)	1963	1995	Zimbabwe;Egypt;South Africa;Kenya
HM587125	O	92	98	490	assigned (bootstrap value >= 70)	1951	1951	South Africa
HM587108	I	87	90	490	assigned (bootstrap value >= 70)	1955	1956	South Africa
MG972973	C	88	96	3852	assigned (bootstrap value >= 70)	1976	2016	South Africa;Somalia;Uganda;Angola;Madagascar;Sudan;Zimbabwe;Mauritania;Saudi Arabia;Kenya
AF134496	N	88	84	738	assigned (bootstrap value >= 70)	1975	1993	Senegal;Mauritania;Burkina Faso
EU574086.1	J	74	33	1690	unassigned (bootstrap value < 70)
RVFV_Namibia_2011_MT561463_NAM_2011	C	89	95	3830	assigned (bootstrap value >= 70)	1976	2016	South Africa;Somalia;Uganda;Angola;Madagascar;Sudan;Zimbabwe;Mauritania;Saudi Arabia;Kenya

If --skip_diamond is not used, the classification file diamond_results.csv is not generated

QueryID	Length	SubjectID	Segment	Product	PercentIdentity	Mismatches	Gaps
HM587118	489	YP_003848705.1	M	glycoprotein	100	0	0
MG972973	3591	YP_003848705.1	M	glycoprotein	99.3	8	0
DQ380221	3591	YP_003848705.1	M	glycoprotein	99	4	7
AF134496	738	YP_003848705.1	M	glycoprotein	98.8	3	0
DQ380222	3591	YP_003848705.1	M	glycoprotein	99.2	9	0
EU574086.1	795	YP_003848706.1	S	non-structural protein	97.4	7	0
EU574086.1	735	YP_003848707.1	S	nucleocapsid	99.6	1	0
RVFV_Namibia_2011_MT561463_NAM_2011	3558	YP_003848705.1	M	glycoprotein	99.2	9	0
DQ380218	3591	YP_003848705.1	M	glycoprotein	99.5	6	0
HM587108	489	YP_003848705.1	M	glycoprotein	100	0	0
DQ380189	3591	YP_003848705.1	M	glycoprotein	98.9	13	0
HM587125	489	YP_003848705.1	M	glycoprotein	99.4	1	0
HM587045	489	YP_003848705.1	M	glycoprotein	100	0	0

Web application.

The tool is also implemented as a web application at https://www.genomedetective.com/app/typingtool/rvfv/

Pipeline Summary

By default, the pipeline currently performs the following:

Classification of query sequence(s) (diamond)
Phylogenetic typing (iqtree)

Credits

rvfvtyping was originally written by John Juma.

We thank the following people for their extensive assistance in the development of this pipeline:

License

rvfvtyping is free software, licensed under GPLv3.

Issues

Please report any issues to the issues page.

Contribute

If you wish to fix a bug or add new features to the software we welcome Pull Requests. We use GitHub Flow style development. Please fork the repo, make the change, then submit a Pull Request against out master branch, with details about what the change is and what it fixes/adds. We will then review your changes and merge them, or provide feedback on enhancements.

Citations

rvfvtyping pipeline uses the following software:

Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176

Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52(5), 696–704. https://doi.org/10.1080/10635150390235520

Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q., & Vinh, L. S. (2018). UFBoot2: Improving the Ultrafast Bootstrap Approximation. Molecular Biology and Evolution, 35(2), 518–522. https://doi.org/10.1093/molbev/msx281

Huelsenbeck, J. P., & Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8), 754–755. https://doi.org/10.1093/bioinformatics/17.8.754

Katoh, K. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14), 3059–3066. https://doi.org/10.1093/nar/gkf436

Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. J., & Higgins, D. G. (2007). Clustal W and Clustal X version 2.0. Bioinformatics, 23(21), 2947–2948. https://doi.org/10.1093/bioinformatics/btm404

Vilsker, M., Moosa, Y., Nooij, S., Fonseca, V., Ghysens, Y., Dumon, K., Pauwels, R., Alcantara, L. C., Vanden Eynden, E., Vandamme, A.-M., Deforche, K., & de Oliveira, T. (2019). Genome Detective: An automated system for virus identification from high-throughput sequencing data. Bioinformatics, 35(5), 871–873. https://doi.org/10.1093/bioinformatics/bty695

Yu, G., Smith, D. K., Zhu, H., Guan, Y., & Lam, T. T.-Y. (2017). ggtree: An r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution, 8(1), 28–36. https://doi.org/10.1111/2041-210X.12628

An imagemagick-like frontend to Biopython SeqIO seqmagick

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
bin		bin
conf		conf
data		data
db		db
docs		docs
igenomes/RVFV		igenomes/RVFV
lib		lib
modules		modules
segments		segments
workflows		workflows
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Testing

Usage

Method details

mandatory parameters

Output

Web application.

Pipeline Summary

Credits

License

Issues

Contribute

Citations

About

Releases 2

Packages

Languages

License

ajodeh-juma/rvfvtyping

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Testing

Usage

Method details

mandatory parameters

Output

Web application.

Pipeline Summary

Credits

License

Issues

Contribute

Citations

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages