Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpretation of --maxaccepts vsearch #569

Open
Robvh-git opened this issue Aug 5, 2024 · 3 comments
Open

interpretation of --maxaccepts vsearch #569

Robvh-git opened this issue Aug 5, 2024 · 3 comments

Comments

@Robvh-git
Copy link

Robvh-git commented Aug 5, 2024

Hello,

I've got a question regarding the argument --maxaccepts of the vsearch command --cluster_fast:

The manpage states the following about maxaccepts:

"The search process sorts target sequences
by decreasing number of k-mers they have in common with the query sequence, using
that information as a proxy for sequence similarity. After pairwise alignments, if the
first target sequence passes the acceptation criteria, it is accepted as best hit and the
search process stops for that query. If --maxaccepts is set to a higher value, more hits
are accepted
"

What is exactly meant with "If --maxaccepts is set to a higher value, more hits
are accepted
" ?

What will happen when another hit is accepted?

I guess the target sequences are the centroids or seed sequences of the clusters in this case?

So these are clusters (i.e. target sequences) are sorted based on number of k-mers in common, which will likely resemble pairwise sequence similarity.

I can understand that if --maxaccepts 1(default) is specified, vsearch then starts to go through these pairwise alignment and selects the first one that matches the criteria (e.g. 97% similarity). Then the query sequence is placed in that cluster(?)

But if e.g. --maxaccepts 2 is specified, the query sequence can be accepted in two clusters? Or how does this work?

I can imagine that the first alignment that matches the criterion is not the best one and so that you preferably check multiple accepted target sequences and select the best one from that (i.e. place your query sequence in the cluster that matches best). Is that what --maxaccepts is about? In that case, I would except a description like: " If --maxaccepts is set to a higher value, more hits are accepted and the best matching target sequence is finally selected as hit" or something like that.

@torognes
Copy link
Owner

torognes commented Aug 6, 2024

Hi

Thank you for your questions. I'll try to clarify.

During clustering and other many other tasks, vsearch will perform heuristic searches to find similar sequences. This is done, as you describe, by first considering the number of shared k-mers (8-mers by default) between the query and each target sequence. The target sequences are then sorted by decreasing number of shared k-mers. The sequence with the highest number of shared k-mers is considered first. If this sequence has the required amount of similarity with the query sequence in terms of percentage identity (e.g. 97%) or other requirements (depending on options used), it is "accepted". If it does not satisfy the requirements, it is "rejected". If the --maxaccepts option is used and set to higher than 1 (default), the next target sequence, with the next highest number of shared k-mers, will also be considered. If this sequence also meets the requirements (e.g. 97% identity), it will also be accepted. In this way more than one sequence may be accepted. When the maximum number of accepted sequences (option --maxaccepts, default 1) or rejected sequences (option --maxrejects, default 32) is reached, vsearch will stop considering more target sequences for this query.

What happens if more than one target sequence is accepted? When clustering, the default is to sort the accepted sequences by sequence similarity and choose the target sequence, i.e. centroid, that has the highest similarity. The query sequence is then placed in that cluster. Alternatively, if the --sizeorder option is specified, the accepted centroids will be sorted by abundance, and the centroid with the highest abundance will be chosen.

When searching, not clustering, one or more of the target sequences may be reported as hits for the query, depending on the --maxhits and --top_hits_only options.

I agree that the documentation could be clearer regarding this issue. We will try to improve it for the next release.

@Robvh-git
Copy link
Author

Hi @torognes ,
thank you for the elaborate answer and it is completely clear now.
I think it indeed could be helpful to add this info to the docmentation.

@torognes
Copy link
Owner

torognes commented Aug 9, 2024

Reopening the issue to remember to update the documentation.

@torognes torognes reopened this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants