Skip to content

Latest commit

 

History

History
55 lines (49 loc) · 3.77 KB

databases.md

File metadata and controls

55 lines (49 loc) · 3.77 KB

New databases for 2020!

There are new databases available for 2020. Chose the right one for your needs:

  • Refseq standard: Refseq genomes assembled to "Complete Genome" or "Scaffold" quality.
    • SCG location: /oak/stanford/scg/lab_asbhatt/data/program_indices/kraken2/kraken_custom_jan2020/refseq
    • Genomes from the following categories
      • Bacteria
      • Archaea
      • Fungi
      • Human
      • Mouse
      • Plasmid
  • Refseq viral: same as above, but only viral sequences
    • SCG location: /oak/stanford/scg/lab_asbhatt/data/program_indices/kraken2/kraken_custom_jan2020/refseq_viral Genbank expanded: Genbank genomes assembled to "Complete Genome" "Scaffold" or "Chromosome" quality from the following categories. The largest and most permissive database. Note that 256G memory necessary for this database.
    • SCG location: /oak/stanford/scg/lab_asbhatt/data/program_indices/kraken2/kraken_custom_jan2020/genbank_genome_chromosome_scaffold
    • Genomes from the following categories
      • Bacteria
      • Archaea
      • Fungi
      • Human
      • Mouse
      • Plasmid This databse contains a total of 14686 organisms at the species level.
  • Genbank viral: same as above, but only viral sequences
    • /oak/stanford/scg/lab_asbhatt/data/program_indices/kraken2/kraken_custom_jan2020/genbank_viral

Be aware that Genbank contains sequences that were not high quality enough to be added to Refseq. This includes things like MAGs, environmental samples, etc. Some of those genomes are of interest to us, like surveillance projects. See this page for information on Genbank vs Refseq.

Additionally, each database has had the "crAss-like phages" clade replaced with the genomes from Guerin et al. 2018, following the clustering hierarchy from that paper. The rank and taxid of each cluster:

  • Guerin_crAss_Alpha Genus 4000100
  • Guerin_crAss_Alpha_01 Species 4000110
  • Guerin_crAss_Alpha_03 Species 4000120
  • Guerin_crAss_Alpha_04 Species 4000130
  • Guerin_crAss_Alpha_09 Species 4000140
  • Guerin_crAss_Beta Genus 4000200
  • Guerin_crAss_Beta_06 Species 4000210
  • Guerin_crAss_Gamma Genus 4000300
  • Guerin_crAss_Gamma_02 Species 4000310
  • Guerin_crAss_Gamma_05 Species 4000320
  • Guerin_crAss_Delta Genus 4000400
  • Guerin_crAss_Delta_07 Species 4000410
  • Guerin_crAss_Delta_08 Species 4000420
  • Guerin_crAss_Delta_10 Species 4000430

To see if your organism of interest is present in a database (and therefore is able to be classified in your reads), search the inspect.out file in the database folder. If this has the name or taxonomic identifier you're interested in, you're good to go! In this example we see that a crAss-like phage cluster is indeed present in the Refseq viral database:

$ grep -i Guerin_crAss_Alpha_01 /oak/stanford/scg/lab_asbhatt/data/program_indices/kraken2/kraken_custom_jan2020/refseq_viral/inspect.out 
>    0.31  227983  227983  S       4000110               Guerin_crAss_Alpha_01

Custom sequences

Don't see your favorite bug in the inspect.out file? Have a newly assembled organism you want added to the database? It's an easy process to put custom sequences into the database and can be done in under a day of processing time. Contact Ben for requests (I'll batch these to once a month or something if there are lots of requests.)

Guerin E, Shkoporov A, Stockdale SR, Clooney AG, Ryan FJ, Sutton TDS, et al. Biology and Taxonomy of crAss-like Bacteriophages, the Most Abundant Virus in the Human Gut. Cell Host & Microbe [Internet]. 2018 Oct 25 [cited 2018 Oct 25];0(0). Available from: http://www.cell.com/cell-host-microbe/abstract/S1931-3128(18)30524-9