Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MMseqs2 clustering and taxonomy #6574

Open
wants to merge 53 commits into
base: main
Choose a base branch
from

Conversation

hugolefeuvre
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

Comment on lines 19 to 21
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
mmseqs createtaxdb
'$createtaxdb.database_type.mmseqs2_db_select.fields.path'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try:

Suggested change
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
mmseqs createtaxdb
'$createtaxdb.database_type.mmseqs2_db_select.fields.path'
ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb &&
mmseqs createtaxdb
'taxdb'

If needed add an extension to taxdb.

It seems that in the end this script
is executed and I guess that the trick should work.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

Of course the option to allow users to provide their own mapping needs to be preserved.

Wondering if createdb and createtaxdb should be separate tools / data managers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

I don't know enough about this to be able to answer your question.

Wondering if createdb and createtaxdb should be separate tools / data managers.

We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the main question is if the output of createdb and createtaxdb can be used by multiple tools / reused later.

There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.

My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.

But this is only my intuition and I leave this to you.

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first quick review of the tools. Thanks already for the massive amount of work.

]]></command>
<inputs>
<section name="createdb" title="Convert FASTA/Q file(s) to MMseqs sequence DB format" expanded="true">
<param name="input_fasta" type="data" format="fasta,fasta.gz" label="Input fasta file" help="" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section title suggests that FASTQ is allowed, but it's not listed in the format.

</param>
</section>
<section name="filtertaxseqdb" title="Filter taxonomy sequence database">
<conditional name="filtertaxseqdb_bool">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that you need this conditional. You can just check if the taxon_list valiable is None. Also, what would happen if you say "yes" and do not provide a taxon list?

<section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs">
<conditional name="alph_type">
<param name="type" type="select" label="Alphabet type" help="" >
<option value="amino_acid" selected="true">Amino acid</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be several places where you select AA / NT. Are these choices related?

<option value="nucleotide">Nucleotide</option>
</param>
<when value="amino_acid">
<param name="alph_size_amino_acid" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use argument instead of name (or in addition) where possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are alphabet sizes really of interest to users?

<param name="kmer_per_seq_scale" type="float" min="0" value="0.000" label="Scale k-mer per sequence based on sequence length" help=""/>
</when>
<when value="nucleotide">
<param name="alph_size_nucleotide" type="integer" min="2" max="21" value="21" label="Alphabet size" help=""/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the defaults for the alphabet size mixed up. 21 should be for AA and 5 (or 4?) for NT?

<assert_contents>
<has_line line="MYSTERY.13&#009;MYSTERY.13"/>
<has_n_columns n="2"/>
<has_size value="113000" delta="50000"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a big delta? Maybe also/instead use has_n_lines?

--taxon-list '$filtertaxseqdb.filtertaxseqdb_bool.use_filter.taxon_list' &&
#end if
#end if
chmod -Rv 766 '$createtaxdb.database_type.mmseqs2_db_select.fields.path'_taxonomy &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must not (and should not be able to) write to anything outside of the job working dir (+anything that is referred to by variables).

</conditional>
</section>
<section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs">
<conditional name="alph_type">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple tools share parameters then often macros are a good idea.

</when>
<when value="nucleotide">
<param name="alph_size_nucleotide" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/>
<param name="zdrop" type="integer" min="0" value="40" label="Maximal allowed difference between score values before alignment is truncated" help=""/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the label I'm wondring if this is applicable to nt sequences only?

<param argument="--reverse-frames" type="text" value="1,2,3" label="Comma-separated list of frames on the reverse strand to be extracted" help=""/>
<param argument="--translation-table" type="select" label="Translation table" help="">
<option value="1" selected="true">Canonical</option>
<option value="2">Vert_mitochondrial</option>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe spell out: vertebrate mitochondrial .. also some other options. Maybe use these names: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Comment on lines +294 to +297
**References**
- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)
- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, btab184 (2021)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to have references only in the citations list.

It offers an efficient clustering workflow, scaling linearly with input size. Similar to easy-cluster, but more suitable for handling very large datasets efficiently.
By Martin Steinegger <[email protected]> & Milot Mirdita <[email protected]> & Florian Breitwieser <[email protected]> & Eli Levy Karin <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the citations are sufficient and this can be removed. Maybe link the source code repo instead. Opening issues is usually better than writing mails.

Ensure that the _taxonomy file is not written to the test data section

Co-authored-by: M Bernt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants