Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read orientation auto detection failure #177

Open
wasade opened this issue Apr 15, 2022 · 6 comments
Open

Read orientation auto detection failure #177

wasade opened this issue Apr 15, 2022 · 6 comments
Labels
good first issue Good for newcomers Hacktoberfest Good fit for the Hacktoberfest challenge

Comments

@wasade
Copy link
Member

wasade commented Apr 15, 2022

Bug Description
When classifying a 23M feature set, and separately a 20M subset, it was observed that the number of reported Archaea differed by two orders of magnitude (4k in 23M and 400k in 20M).

The behavior was observed for both Greengenes and SILVA with QIIME 2 2022.2.

On Slack, @BenKaehler kindly suggested testing the --p-read-orientation parameter with an individual sequence.

In the below example, a single 90nt sequence from the 2017 EMP paper, originally classified to the order level within Archaea, is tested with both same and reverse-complement settings. In the reverse complement case, we observe the sequence being classified as k__Bacteria with high confidence.

Steps to reproduce the behavior

$ md5sum ../../gg-13-8-99-515-806-nb-classifier.qza
9e82e8969303b3a86ac941ceafeeac86  ../../gg-13-8-99-515-806-nb-classifier.qza
$ cat example.fna
>example
CACCGGCGGCTCGAGTGGTGGCCGCTATTATTGGGCTTAAAGGGTCCGTAGCCGGGCCAGTTAGTCCCTTGGGAAATCTTACGGCTTAAC
$ qiime tools import --input-path example.fna --output-path example.fna.qza --type FeatureData[Sequence]
Imported example.fna as DNASequencesDirectoryFormat to example.fna.qza
$ qiime feature-classifier classify-sklearn \
    --i-reads example.fna.qza \
    --i-classifier ../../gg-13-8-99-515-806-nb-classifier.qza \
    --o-classification example.classified.same.qza \
    --p-read-orientation same
Saved FeatureData[Taxonomy] to: example.classified.same.qza
$ qiime feature-classifier classify-sklearn \
    --i-reads example.fna.qza \
    --i-classifier ../../gg-13-8-99-515-806-nb-classifier.qza \
    --o-classification example.classified.rc.qza \
    --p-read-orientation reverse-complement
Saved FeatureData[Taxonomy] to: example.classified.rc.qza
$ qiime tools export --input-path example.classified.same.qza --output-path example.classified.same
Exported example.classified.same.qza as TSVTaxonomyDirectoryFormat to directory example.classified.same
$ qiime tools export --input-path example.classified.rc.qza --output-path example.classified.rc
Exported example.classified.rc.qza as TSVTaxonomyDirectoryFormat to directory example.classified.rc
$ cat example.classified.same/taxonomy.tsv 
Feature ID	Taxon	Confidence
example	k__Archaea; p__Euryarchaeota; c__Methanomicrobia; o__Methanocellales; f__; g__; s__	0.999986467281595
$ cat example.classified.rc
example.classified.rc/     example.classified.rc.qza  
$ cat example.classified.rc/taxonomy.tsv 
Feature ID	Taxon	Confidence
example	k__Bacteria	0.9709375640478621

Expected behavior
The result of classifying an Archaea with high confidence as a Bacteria was surprising. However, the user is not presented with an indication this may be the case. Given the how fast classification occurs, and the risk incorrect results presents to a user, would it make sense to test both orientations and retain the one with higher confidence?

Computation Environment

$ cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)
$ uname -a
Linux barnacle2 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:32:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ qiime info
System versions
Python version: 3.8.12
QIIME 2 release: 2022.2
QIIME 2 version: 2022.2.0
q2cli version: 2022.2.0

Installed plugins
alignment: 2022.2.0
composition: 2022.2.0
cutadapt: 2022.2.0
dada2: 2022.2.0
deblur: 2022.2.0
demux: 2022.2.0
diversity: 2022.2.0
diversity-lib: 2022.2.0
emperor: 2022.2.0
feature-classifier: 2022.2.0
feature-table: 2022.2.0
fragment-insertion: 2022.2.0
gneiss: 2022.2.0
longitudinal: 2022.2.0
metadata: 2022.2.0
phylogeny: 2022.2.0
quality-control: 2022.2.0
quality-filter: 2022.2.0
sample-classifier: 2022.2.0
taxa: 2022.2.0
types: 2022.2.0
vsearch: 2022.2.0

Application config directory
/home/mcdonadt/miniconda3/envs/qiime2-2022.2/var/q2cli

Getting help
To get help with QIIME 2, visit https://qiime2.org

References

  1. Slack #developers channel, April 14th
@nbokulich
Copy link
Member

thanks @wasade I think this is basically a duplicate of #103 ... orientation autodetection is done on the first 100 seqs I believe, so we see problems like this esp. if reads are in mixed orientations or there are a few noisy or non-target reads (even just a few bad seeds).

would it make sense to test both orientations and retain the one with higher confidence?

this is what is done on the first 100 reads to autodetect. It has been discussed for some time whether this should instead be done on all reads, i.e., test both orientations and pick the one that looks most reasonable. As your example shows, usually only one orientation looks reasonable and the wrong orientation is usually classified at domain level or unclassified.

It would increase runtime but I would personally be in favor of this (or adding a both orientation option, so the current autodetect on a subsample remains as an option), as it would fix a few old thorns in our side regarding autodetection and mixed orientations. Would you be interested in adding this?

@wasade
Copy link
Member Author

wasade commented Apr 15, 2022

Thanks, @nbokulich. Unfortunately, I will be going on paternity leave soon and my bandwidth to understand unfamiliar codebases is quite limited. What I recommend considering is using both by default, erring on the side of caution, as the consequence of auto detection being wrong could be bad.

@nbokulich
Copy link
Member

congrats! that makes two of us — maybe @BenKaehler would like to take up the task?

I agree that both should be the new default. But we could keep the original autodetect as an option for those who liked that behavior.

@wasade
Copy link
Member Author

wasade commented Apr 15, 2022

Thanks!! And congrats to you as well!

I think that sounds like a great plan

@gregcaporaso
Copy link
Member

@nbokulich, @wasade - anyone interested in picking this up? I think this is also a good first issue, so I'll tag it with that for Hacktoberfest folks or other new developers.

@gregcaporaso gregcaporaso added the good first issue Good for newcomers label Sep 25, 2023
@nbokulich
Copy link
Member

yeah I agree, hacktoberfest

@nbokulich nbokulich added the Hacktoberfest Good fit for the Hacktoberfest challenge label Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers Hacktoberfest Good fit for the Hacktoberfest challenge
Projects
None yet
Development

No branches or pull requests

3 participants