Releases: iquasere/reCOGnizer
Increase maximum SMPs per database
Set option -max_smp_vol 1000000
for the makeprofiledb
command.
Context: the blast
package had an update, and the makeprofiledb
tool now outputs a database for each 1000 HMM profiles by default.
Fix on COG2KO
Blocked it for now. So reCOGnizer finishes its workflow.
Major improvements on reporting results
Columns have been standardized to have the same names, regardless of database
For example, COG functional category and cog columns renamed to functional category and DB ID, respectively
Helps to provide a simpler report, with much less NA values
Databases now inputted as comma-separated values
No problem when using one or all default databases (without specifying values), but breaks backwards compatibility, and so version was upped to 1.8.
Also some miscellaneous fixes
Prohibited creating kronas when there is no annotation for the respective database (COG or KOG)
Removed Biopython as dependency
Intermediates now removed
Files in the asn
, blast
, rpsbproc
are again removed.
Fixes in versions
So reCOGnizer can be integrated easily with other tools, versions for krona
and Biopython
were relaxed.
Because of a previous bug in blast 2.11
, version of blast
was set to >=2.12
.
BLAST version relaxed
Now can use any blast
version, as new ones come fixed from the bug that prevented using newer versions in reCOGnizer
EC numbers obtained from CDD and Smart
EC numbers are now obtained from parsing database descriptions of CDD and Smart.
For Smart, all EC numbers are obained, as they are always respective of the domain described.
In the case of CDD, only EC numbers in the form "(EC:X.X.X.X)" are obtained, as many more EC numbers are reference in other formats that are respective of other proteins in the same domain family, but not respective to the domain in question.
A working Continuous Integration
Added mini cdd.tar.gz with only some HMMs for all databases
New parameter of reCOGnizer, --skip-downloaded
, mainly for CI: if set, files already downloaded will be skiped, no longer asking for the files one at a time
Also simplified some intermediate tasks
- "Organize COGs to each tax ID" is now limited to when taxonomy is relevant
- cog2ko downloads are simplified: silenced with the
-q
parameter ofwget
Removal of artifacts and bug fixes
Removal of artifacts
Now removes CDD tarball
Now removes all files helper directories: fasta, asn, blast, rpsbproc and tmp
Integrated cog2ec.py code
Bug fixes
Fix on pointing to directory where SMPs are now
Fix on only reporting time in hours, minutes and seconds: now also reports days
Removed redundant asking for resources download
Also changed default of --max-target-seqs
from 1 to 20
Now downloads RPSBPROC files
reCOGnizer now downloads the following files to --resources_directory
:
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/bitscore_specific.txt
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddannot.dat.gz
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddannot_generic.dat.gz
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddid.tbl.gz
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/cdtrack.txt
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/family_superfamily_links
and gunzips the archives
This fixes #4
Implemented COG taxonomic workflow
COG annotation can now follow an alternative workflow based on taxonomy.
- if
--tax-file
is inputted and--species-taxids
is set --species-taxids
new parameter, just for this- SMPs will each be its own database
- Tax ID to list of COGs is estimated from
NOG.members.tsv
- If a Tax ID from tax file is present in tax ID to COG, those COGs will be used as reference database