-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read orientation auto detection failure #177
Comments
thanks @wasade I think this is basically a duplicate of #103 ... orientation autodetection is done on the first 100 seqs I believe, so we see problems like this esp. if reads are in mixed orientations or there are a few noisy or non-target reads (even just a few bad seeds).
this is what is done on the first 100 reads to autodetect. It has been discussed for some time whether this should instead be done on all reads, i.e., test both orientations and pick the one that looks most reasonable. As your example shows, usually only one orientation looks reasonable and the wrong orientation is usually classified at domain level or unclassified. It would increase runtime but I would personally be in favor of this (or adding a |
Thanks, @nbokulich. Unfortunately, I will be going on paternity leave soon and my bandwidth to understand unfamiliar codebases is quite limited. What I recommend considering is using |
congrats! that makes two of us — maybe @BenKaehler would like to take up the task? I agree that |
Thanks!! And congrats to you as well! I think that sounds like a great plan |
@nbokulich, @wasade - anyone interested in picking this up? I think this is also a good first issue, so I'll tag it with that for Hacktoberfest folks or other new developers. |
yeah I agree, hacktoberfest |
Bug Description
When classifying a 23M feature set, and separately a 20M subset, it was observed that the number of reported Archaea differed by two orders of magnitude (4k in 23M and 400k in 20M).
The behavior was observed for both Greengenes and SILVA with QIIME 2 2022.2.
On Slack, @BenKaehler kindly suggested testing the
--p-read-orientation
parameter with an individual sequence.In the below example, a single 90nt sequence from the 2017 EMP paper, originally classified to the order level within Archaea, is tested with both
same
andreverse-complement
settings. In the reverse complement case, we observe the sequence being classified ask__Bacteria
with high confidence.Steps to reproduce the behavior
Expected behavior
The result of classifying an Archaea with high confidence as a Bacteria was surprising. However, the user is not presented with an indication this may be the case. Given the how fast classification occurs, and the risk incorrect results presents to a user, would it make sense to test both orientations and retain the one with higher confidence?
Computation Environment
References
The text was updated successfully, but these errors were encountered: