Add MMseqs2 clustering and taxonomy #6574

hugolefeuvre · 2024-11-19T14:28:13Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

Merge mmseqs2_DM branch into mmseqs2 branch

…nput/output paramaters that are not detected

…ting

bernt-matthias · 2024-11-20T11:14:52Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
+mmseqs createtaxdb
+    '$createtaxdb.database_type.mmseqs2_db_select.fields.path'


Can you try:

Suggested change

ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&

mmseqs createtaxdb

'$createtaxdb.database_type.mmseqs2_db_select.fields.path'

ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&

ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb &&

mmseqs createtaxdb

'taxdb'

If needed add an extension to taxdb.

It seems that in the end this script
is executed and I guess that the trick should work.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

Of course the option to allow users to provide their own mapping needs to be preserved.

Wondering if createdb and createtaxdb should be separate tools / data managers.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

I don't know enough about this to be able to answer your question.

Wondering if createdb and createtaxdb should be separate tools / data managers.

We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.

I guess the main question is if the output of createdb and createtaxdb can be used by multiple tools / reused later.

There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.

My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.

But this is only my intuition and I leave this to you.

bernt-matthias

Did a first quick review of the tools. Thanks already for the massive amount of work.

bernt-matthias · 2024-11-20T11:17:22Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+    ]]></command>
+    <inputs>
+        <section name="createdb" title="Convert FASTA/Q file(s) to MMseqs sequence DB format"  expanded="true">
+            <param name="input_fasta" type="data" format="fasta,fasta.gz" label="Input fasta file" help="" />


Section title suggests that FASTQ is allowed, but it's not listed in the format.

bernt-matthias · 2024-11-20T11:20:35Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+            </param>
+        </section>
+        <section name="filtertaxseqdb" title="Filter taxonomy sequence database">
+            <conditional name="filtertaxseqdb_bool">


I do not think that you need this conditional. You can just check if the taxon_list valiable is None. Also, what would happen if you say "yes" and do not provide a taxon list?

bernt-matthias · 2024-11-20T11:29:22Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+        <section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs">
+            <conditional name="alph_type">
+                <param name="type" type="select" label="Alphabet type" help="" >
+                    <option value="amino_acid" selected="true">Amino acid</option>


There seem to be several places where you select AA / NT. Are these choices related?

bernt-matthias · 2024-11-20T11:31:03Z

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

+                <option value="nucleotide">Nucleotide</option>
+            </param>
+            <when value="amino_acid">
+                <param name="alph_size_amino_acid" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/>


always use argument instead of name (or in addition) where possible.

Are alphabet sizes really of interest to users?

bernt-matthias · 2024-11-20T11:32:30Z

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

+                <param name="kmer_per_seq_scale" type="float" min="0" value="0.000" label="Scale k-mer per sequence based on sequence length" help=""/>
+            </when>
+            <when value="nucleotide">
+                <param name="alph_size_nucleotide" type="integer" min="2" max="21" value="21" label="Alphabet size" help=""/>


Are the defaults for the alphabet size mixed up. 21 should be for AA and 5 (or 4?) for NT?

bernt-matthias · 2024-11-20T11:42:48Z

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

+                <assert_contents>
+                    <has_line line="MYSTERY.13&#009;MYSTERY.13"/>
+                    <has_n_columns n="2"/>
+                    <has_size value="113000" delta="50000"/>


Quite a big delta? Maybe also/instead use has_n_lines?

bernt-matthias · 2024-11-20T11:45:02Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+                --taxon-list '$filtertaxseqdb.filtertaxseqdb_bool.use_filter.taxon_list' &&
+            #end if
+    #end if
+chmod -Rv 766 '$createtaxdb.database_type.mmseqs2_db_select.fields.path'_taxonomy &&


You must not (and should not be able to) write to anything outside of the job working dir (+anything that is referred to by variables).

bernt-matthias · 2024-11-20T11:46:52Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+            </conditional>
+        </section>
+        <section name="taxonomy" title="Taxonomy assignment by computing the lowest common ancestor of homologs">
+            <conditional name="alph_type">


If multiple tools share parameters then often macros are a good idea.

bernt-matthias · 2024-11-20T11:47:39Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+                </when>
+                <when value="nucleotide">
+                    <param name="alph_size_nucleotide" type="integer" min="2" max="5" value="5" label="Alphabet size" help=""/>
+                    <param name="zdrop" type="integer" min="0" value="40" label="Maximal allowed difference between score values before alignment is truncated" help=""/>


From the label I'm wondring if this is applicable to nt sequences only?

bernt-matthias · 2024-11-20T11:49:40Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+                <param argument="--reverse-frames" type="text" value="1,2,3" label="Comma-separated list of frames on the reverse strand to be extracted" help=""/>
+                <param argument="--translation-table" type="select" label="Translation table" help="">
+                    <option value="1" selected="true">Canonical</option>
+                    <option value="2">Vert_mitochondrial</option>


Maybe spell out: vertebrate mitochondrial .. also some other options. Maybe use these names: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

bernt-matthias · 2024-11-20T11:51:52Z

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

+**References**
+
+- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)
+- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, btab184 (2021)


I think its better to have references only in the citations list.

bernt-matthias · 2024-11-20T11:53:55Z

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

+It offers an efficient clustering workflow, scaling linearly with input size. Similar to easy-cluster, but more suitable for handling very large datasets efficiently.
+
+
+By Martin Steinegger <[email protected]> & Milot Mirdita <[email protected]> & Florian Breitwieser <[email protected]> & Eli Levy Karin <[email protected]>


I think the citations are sufficient and this can be removed. Maybe link the source code repo instead. Opening issues is usually better than writing mails.

Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>

clsiguret and others added 30 commits October 15, 2024 11:04

Init mmseqs2

6467853

init DM

77d1bef

continue DM

3ad2ea4

Split to TOOL_VERSION and COMMIT

b9da697

Modify macros and json output

0ac3019

update macro

75f154d

init mmseqs2_taxonomy

fc600b6

init mmseqs2_createtaxdb

403a2d9

Change name and description

6c308bb

init mmseqs2_createdb

afd1577

init mmseqs2_createtsv

ec54759

init mmseqs2_createtsv

d48ab99

Merge branch 'mmseqs2' of github.com:clsiguret/tools-iuc into mmseqs2

71779aa

continue DM

5f584df

init taxonomyreport

0c250c5

add test files for createtsv

f96947c

Add second test with other data table

09771c7

add double quote

7f2c49e

start create_db

88da237

continue mmseqs2 DM (macros modification)

157360f

put all xml into one

410d4f4

put all xml into one

3c3d188

Merge branch 'galaxyproject:main' into mmseqs2

8fbf055

add createdb section

b82acf0

add createtaxdb and filtertaxseqdb sections

0f859b5

Update taxonomy assignement : taxonomy module prefilter options

0bd7ef7

taxonomy part : align parameters

e1299a9

taxonomy module : misc and common options

76da478

all parameters into xml

d0fac9f

finish wrapping command and start tests

279f796

hugolefeuvre and others added 22 commits October 24, 2024 17:05

Change tool name

cb26bf4

Merge branch 'galaxyproject:main' into mmseqs2

aafcdfe

Merge branch 'mmseqs2_DM' into mmseqs2

2333a5a

Merge mmseqs2_DM branch into mmseqs2 branch

add new loc.sample file and modification to pass tests

79f3410

issue with database : test dont select test database

dd5ab01

group multiple conditionnal part

cbf00be

start mmseqs2 easy-linclust wrapper

29e9015

finish mmseqs2 linclust wrapping

1d28b66

start easy-taxo wrapper, I want to compare taxonomy and easy-taxonomy

53c1550

modify easy-taxonomy : conditionnal and resolve DB issue

6649879

Merge branch 'galaxyproject:main' into mmseqs2

8bac29c

update taxo : issue with mmseqs, update with easy-taxo : issue with i…

784bb52

…nput/output paramaters that are not detected

filter kraken or krona output

5f3a7e3

start modify DM

98e72f3

modify DM path and json informations

eb4187d

wrong value

1ba4a1f

start multiple datatable management

f46798e

add nucleotide data table into param

65c07c3

Reduced database, possibility of having the 2 types of report

3efd6ab

Merge branch 'galaxyproject:main' into mmseqs2

78ba0c6

delete useless files and parameters + last tests

220cd5a

try to chmod Swiss-Prot_taxonomy because error could not open for wri…

c729ba6

…ting

bernt-matthias reviewed Nov 20, 2024

View reviewed changes

Create a symlink of the database to the job working directory

072fc28

Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMseqs2 clustering and taxonomy #6574

Add MMseqs2 clustering and taxonomy #6574

hugolefeuvre commented Nov 19, 2024

bernt-matthias Nov 20, 2024

hugolefeuvre Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias left a comment

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

bernt-matthias Nov 20, 2024

		It offers an efficient clustering workflow, scaling linearly with input size. Similar to easy-cluster, but more suitable for handling very large datasets efficiently.


		By Martin Steinegger <[email protected]> & Milot Mirdita <[email protected]> & Florian Breitwieser <[email protected]> & Eli Levy Karin <[email protected]>

Add MMseqs2 clustering and taxonomy #6574

Are you sure you want to change the base?

Add MMseqs2 clustering and taxonomy #6574

Conversation

hugolefeuvre commented Nov 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bernt-matthias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment