-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MMseqs2 clustering and taxonomy #6574
base: main
Are you sure you want to change the base?
Conversation
Merge mmseqs2_DM branch into mmseqs2 branch
…nput/output paramaters that are not detected
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | ||
mmseqs createtaxdb | ||
'$createtaxdb.database_type.mmseqs2_db_select.fields.path' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try:
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | |
mmseqs createtaxdb | |
'$createtaxdb.database_type.mmseqs2_db_select.fields.path' | |
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* && | |
ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb && | |
mmseqs createtaxdb | |
'taxdb' |
If needed add an extension to taxdb
.
It seems that in the end this script
is executed and I guess that the trick should work.
Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.
Of course the option to allow users to provide their own mapping needs to be preserved.
Wondering if createdb
and createtaxdb
should be separate tools / data managers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.
I don't know enough about this to be able to answer your question.
Wondering if createdb and createtaxdb should be separate tools / data managers.
We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the main question is if the output of createdb
and createtaxdb
can be used by multiple tools / reused later.
There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.
My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.
But this is only my intuition and I leave this to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first quick review of the tools. Thanks already for the massive amount of work.
]]></command> | ||
<inputs> | ||
<section name="createdb" title="Convert FASTA/Q file(s) to MMseqs sequence DB format" expanded="true"> | ||
<param name="input_fasta" type="data" format="fasta,fasta.gz" label="Input fasta file" help="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Section title suggests that FASTQ is allowed, but it's not listed in the format
.
</param> | ||
</section> | ||
<section name="filtertaxseqdb" title="Filter taxonomy sequence database"> | ||
<conditional name="filtertaxseqdb_bool"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that you need this conditional. You can just check if the taxon_list
valiable is None
. Also, what would happen if you say "yes" and do not provide a taxon list?
<section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs"> | ||
<conditional name="alph_type"> | ||
<param name="type" type="select" label="Alphabet type" help="" > | ||
<option value="amino_acid" selected="true">Amino acid</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be several places where you select AA / NT. Are these choices related?
<option value="nucleotide">Nucleotide</option> | ||
</param> | ||
<when value="amino_acid"> | ||
<param name="alph_size_amino_acid" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always use argument
instead of name
(or in addition) where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are alphabet sizes really of interest to users?
<param name="kmer_per_seq_scale" type="float" min="0" value="0.000" label="Scale k-mer per sequence based on sequence length" help=""/> | ||
</when> | ||
<when value="nucleotide"> | ||
<param name="alph_size_nucleotide" type="integer" min="2" max="21" value="21" label="Alphabet size" help=""/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the defaults for the alphabet size mixed up. 21 should be for AA and 5 (or 4?) for NT?
<assert_contents> | ||
<has_line line="MYSTERY.13	MYSTERY.13"/> | ||
<has_n_columns n="2"/> | ||
<has_size value="113000" delta="50000"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a big delta? Maybe also/instead use has_n_lines
?
--taxon-list '$filtertaxseqdb.filtertaxseqdb_bool.use_filter.taxon_list' && | ||
#end if | ||
#end if | ||
chmod -Rv 766 '$createtaxdb.database_type.mmseqs2_db_select.fields.path'_taxonomy && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must not (and should not be able to) write to anything outside of the job working dir (+anything that is referred to by variables).
</conditional> | ||
</section> | ||
<section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs"> | ||
<conditional name="alph_type"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple tools share parameters then often macros are a good idea.
</when> | ||
<when value="nucleotide"> | ||
<param name="alph_size_nucleotide" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/> | ||
<param name="zdrop" type="integer" min="0" value="40" label="Maximal allowed difference between score values before alignment is truncated" help=""/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the label I'm wondring if this is applicable to nt sequences only?
<param argument="--reverse-frames" type="text" value="1,2,3" label="Comma-separated list of frames on the reverse strand to be extracted" help=""/> | ||
<param argument="--translation-table" type="select" label="Translation table" help=""> | ||
<option value="1" selected="true">Canonical</option> | ||
<option value="2">Vert_mitochondrial</option> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe spell out: vertebrate mitochondrial .. also some other options. Maybe use these names: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
**References** | ||
- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017) | ||
- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, btab184 (2021) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its better to have references only in the citations
list.
It offers an efficient clustering workflow, scaling linearly with input size. Similar to easy-cluster, but more suitable for handling very large datasets efficiently. | ||
By Martin Steinegger <[email protected]> & Milot Mirdita <[email protected]> & Florian Breitwieser <[email protected]> & Eli Levy Karin <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the citations are sufficient and this can be removed. Maybe link the source code repo instead. Opening issues is usually better than writing mails.
Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>
FOR CONTRIBUTOR: