This repository is mostly a description of how to build a database of reference genomes from Refseq and metagenome-assembled genomes (MAGs) from multiple large-scale metagenomic projects from various environments. This database does not include human or host associated MAGs, and is mostly for exploring genomes/marker genes of environmental metagenomes. The included scripts cover downloading and reformatting sets of genomes, and subsequently calling genes or performing functional annotations for a specific subset of downloaded genomes for further analyses.
Refseq database built in July 2019. As of this date, downloading all complete Refseq genomes and MAGs from the below datasets amounts to approximately 30,000 genomes.
To include all genomes from NCBI regardless of completion status, download the genomes from the accession list accessions/2019-08-01-incomplete-genbank-genomes-accessions.txt
. This includes all genomes that are of assembly level chromosome, contig, or scaffold deposited in Genbank as of 2019-08-01. The entire metadata file is too large to store on Github, and is stored in an OSF repository, with dated folders for updated database files.
- Anantharaman et al. 2016 "Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system". Bioproject: PRJNA288027
- Parks et al. 2017 "Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life". Bioproject: PRJNA348753
- Woodcroft et al. 2018 "Genome-centric view of carbon processing in thawing permafrost". Bioproject: PRJNA386568
- Crits-Cristoph et al. 2018 "Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis". Bioproject:PRJNA449266
- Tully et al. 2018 "The reconstruction of 2,361 draft metagenome-assembled genomes from the global oceans". Bioproject: PRJNA391943
- Dombrowski et al. 2018 "Expansive microbial metabolic versatility and biodiversity in dynamics Guaymas Basin hydrothermal sediments". Biproject: PRJNA362212
ncbi-bioproject-files/
contains individual bioproject accession information for all datasets, from which genomes were downloaded throughncbi-genome-download
and used to merge metadatabioproject-accession-lists/
contains accession lists for each bioproject, and the combined list for bulk downloadmetadata/
more detaild metadata information on specific metagenomic projects and downloaded genomes from NCBI
Previously, I would download the entire genbank database (~200,000 genomes) and then go one by one with for loops to reformat, annotate, and search for specific markers of interest. This was extremely tedious, takes up a lot of space on a server, and also takes a long time to go one by one for each of these steps. Using the resources available through HTCondor & UW-Madison Center for High-Throughput Computing, I've repurposed all of these steps so each job is split by a genome assembly, and performs the reformmating, annotating, and marker searches by job. This way the jobs can be highly parallel, and can flock out to other resources such as the open science grid. All that needs to change periodically would be updating the list of genbank assemblies/ftp paths if there are major updates to the database in the metadata/
folder, and whatever marker you want to search for, which is specified in the submit
file.
This pipeline serves somewhat the same and different purposes as the above mentioned steps. For the above, you can search specific, large-scale metagenomic projets for a marker or just to create a nice environmental MAG database. This can search through all Genbank genomes in one-go, including from metagenomic projects.
To run the pipeline, these steps are highly specific to UW-Madison's HTCondor system, specifically for running on the Center for High Throughput Computing cluster. The steps are a bit convoluted for setup, but once they are done you can perform searches for any marker you choose without downloading all of Genbank locally, which at this point you might consider more worthwhile.
- Clone this directory to get all the executables and scripts
- Follow the directions to install an Anaconda python distribution with prodigal, HMMer, and biopython installed with
conda
. - Follow the
prepare-chtc-wrapper.md
instructions based on using the ChtcRun package. The metadata files have already been split up in themetadata/splits
folder, you just have to follow the directions to configure the ChtcRun package correctly with theshared
folder and corresponding queue directories.