The "Web of Life" (WoL) project is a series efforts to reconstruct an accurate reference phylogeny for microbial genomes, and to build resources that can (and are already) benefiting microbiome researchers.
Phase I of the project was already completed (Zhu et al., 2019). We have released a reference tree, built on 10,575 bacterial and archaeal genomes, based on 381 marker genes.
The project is detailed at our website: https://biocore.github.io/wol/, including data and metadata, code, protocols, a gallery and a visualizer. Large data files are hosted at our Globus endpoint: WebOfLife (see instruction).
This public resource provides everything one needs to start microbiome data analysis using WoL, including raw sequence data, metadata, tree and taxonomy, and pre-built databases that are ready to be plugged into your bioinformatics protocols. Currently, we provide databases for QIIME 2, SHOGUN, Bowtie2, Centrifuge, Kraken2 / Bracken, BLASTn and BLASTp, Minimap2, and DIAMOND. Even if your favorate tool is not on this list, we provide detailed tutorials on how to build your own database and many other related protocols. Meanwhile, WoL is also hosted at our web-based microbiome study platform: Qiita (https://qiita.ucsd.edu/) (see details).
The following tutorial assume that you have downloaded the entire WoL directory from our Globus server. The paths mentioned below are relative to this directory.
First, you need to align your sequences (namely your FastQ / Fast5 / BAM files) against the WoL database using an aligner of your choice. Let's take Bowtie2 for example. Our bioinformatics tool, SHOGUN, provides a Bowtie2 wrapper optimized for shotgun metagenomic datasets:
shogun align -d databases/shogun -a bowtie2 -t 16 -p 0.95 -i input.fa -o .
This will generate a SAM format alignment file.
The alignment step has been automated in Qiita. If you use Qiita, the SAM file is ready for download.
[Note] You can also run Bowtie2 manually using your choice of parameters, or using other aligners and other databases. Woltka is designed for flexibility.
woltka classify -i input.sam -o output.biom
Note that you can compress the SAM file to save disk space, and Woltka can parse compressed files.
Use the original NCBI taxonomy:
woltka classify \
--input input.sam \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--output output.biom
Use lineage strings extracted from NCBI (will lose some resolution, but results are more structured, especially for users familiar with QIIME 2):
woltka classify \
--input input.sam \
--lineage taxonomy/lineage.txt \
--output output.biom
We also provide original and curated NCBI and GTDB taxonomy for choice.
Slightly modify the command, adding desired ranks:
woltka classify \
--input input.sam \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--rank phylum,genus,species \
--output output.biom
mcdir=annotation/metacyc
woltka classify \
--input input.sam \
--coords annotation/coords.txt.xz \
--map annotation/uniref.map.xz \
--map $mcdir/protein.map --names $mcdir/protein.names \
--map $mcdir/protein2enzrxn.map --names $mcdir/enzrxn.names \
--map $mcdir/enzrxn2reaction.map --names $mcdir/reaction.names \
--map $mcdir/reaction2pathway.map --names $mcdir/pathway.names \
--map $mcdir/pathway2class.map --names $mcdir/class.names \
--map-as-rank \
--rank protein,enzrxn,reaction,pathway,class \
--output output_dir
Say, you want to stratify functional annotations by genus (taxonomy). First, run taxonomic classification at the genus level, and export read-to-genus maps:
woltka classify \
--input input.sam \
...
--rank genus \
--name-as-id \
--output genus.biom
--outmap map_dir
Second, run functional annotation, adding the read-to-genus maps for stratification:
woltka classify \
--input input.sam \
--coords annotation/coords.txt.xz \
...
--stratify map_dir
--output output_dir