Skip to content
forked from pirovc/taxsbp

TaxSBP - Taxonomic structured bin packing implementation

License

Notifications You must be signed in to change notification settings

mikael-s/taxsbp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

TaxSBP

Vitor C. Piro ([email protected])

Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] based on the NCBI Taxonomy database [2].

Dependencies:

Input:

  • nodes.dmp and a file with sequence information (identifier, length and taxonomic assignment)

nodes.dmp:

wget -qO- ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz | tar xfz - nodes.dmp

sequence information (from NCBI refseq/genbank):

# Bacteria - RefSeq - Complete Genomes (Download assembly reports adding taxonomic assignment at the end) 
wget -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt | tail -n+3 |
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{url_count=split($20,url,"/"); print $6"\t"$20"/"url[url_count] "_assembly_report.txt"}' |
parallel -j 24 --colsep "\t" 'wget -qO- {2} | grep "^[^#]" | tr -d "\r" | sed -e s/$/\\t{1}\\t`basename {2}`/' > refseq_bac_cg_ar.txt 2> refseq_bac_cg_ar.err

Output:

  • A tab-separated file with sequence identifier and bin id

Running:

python3 TaxSBP.py -a refseq_bac_cg_ar.txt -n nodes.dmp -s 2 -b 50

References:

[1] Codenotti, B., De Marco, G., Leoncini, M., Montangero, M., & Santini, M. (2004). Approximation algorithms for a hierarchically structured bin packing problem. Information Processing Letters, 89(5), 215–221. http://doi.org/10.1016/j.ipl.2003.12.001

[2] Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. http://doi.org/10.1093/nar/gkr1178

About

TaxSBP - Taxonomic structured bin packing implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%