usearch_global search aligning to Ns with 100% identity #393

lmolokin · 2020-01-02T22:07:15Z

Seeing false full length alignments that show 100% identity to stretches of Ns.

vsearch v2.14.1_linux_x86_64

vsearch --usearch_global nano_reclust.fa \
--db blastoNCBI_120919.udb \
--userout nano_reclust.vsearch \
--userfields query+id+alnlen+qcov+target \
--output_no_hits \
--id 0.9 \
--query_cov 0.5 \
--maxhits 10 \
--maxaccepts 0 \
--maxrejects 0 \
--alnout nano_reclust.aln

alignment.txt

The text was updated successfully, but these errors were encountered:

torognes · 2020-01-03T09:59:09Z

Thanks for reporting this. I have seen similar behaviour as well. This is related to issue #354.

Matches between/to ambiguous residues is currently counted as matches, and the output is therefore as expected.

Matches to long stretches of N's like this are usually unwanted.

ragavishanmugam · 2021-11-09T04:33:30Z

Any updates on this? We are also facing the same issue skewing the results. Is there a way to see the match score w.r.t alignment length?

torognes · 2021-11-10T15:26:58Z

No, there is currently no way to see the match score. The score for matching a nucleotide vs an N is zero.

I am not sure how to handle this.

Alignments can have a negative score and still be shown, both in vsearch and usearch. The alignment score is just used to align a pair of sequences in the best possible way. Note that terminal gaps (and gap penalties) are usually not counted.

These kind of matches with a lot of Ns can also be produced by usearch, but perhaps not exactly this one with only Ns, due to some heuristics.

To eliminate these kind of matches, I think we need to add an option where ambiguous matches (with other symbols than ACGTU) are not counted as matches. Currently matches between compatible symbols, e.g. A vs R, but not A vs Y, are counted as matches when computing the identity percentage.

We could also add an option to set a (negative) score for ambiguous matches.

ragavishanmugam · 2021-11-10T16:11:01Z

Thank you for replying. My suggestion would be to differentiate Mixed bases ( like A vs R) from more generic bases like (A vs N). If we could differentiate just the ‘N’s it will be useful. Mixed bases could also mean Mixed populations in some cases and are very subjective. I think the practical way to implement this would be to give that option to users. If users can somehow input what combinations can be considered as a match and what would be the weight for each combination on the matching score, It will be useful for all cases. Regards, Ragavi.

torognes self-assigned this Jan 3, 2020

torognes added the enhancement label Jan 3, 2020

ebolyen mentioned this issue Jun 22, 2022

refactored consensus classifiers qiime2/q2-feature-classifier#176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usearch_global search aligning to Ns with 100% identity #393

usearch_global search aligning to Ns with 100% identity #393

lmolokin commented Jan 2, 2020

torognes commented Jan 3, 2020

ragavishanmugam commented Nov 9, 2021

torognes commented Nov 10, 2021

ragavishanmugam commented Nov 10, 2021 via email •

edited

Loading

usearch_global search aligning to Ns with 100% identity #393

usearch_global search aligning to Ns with 100% identity #393

Comments

lmolokin commented Jan 2, 2020

torognes commented Jan 3, 2020

ragavishanmugam commented Nov 9, 2021

torognes commented Nov 10, 2021

ragavishanmugam commented Nov 10, 2021 via email • edited Loading

ragavishanmugam commented Nov 10, 2021 via email •

edited

Loading