Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most general taxon for H5N1 (102793) is missing sequences with H5N1 in sequence name #407

Closed
corneliusroemer opened this issue Oct 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@corneliusroemer
Copy link

corneliusroemer commented Oct 1, 2024

Describe the bug

There seem to be H5N1 sequences that are not returned when querying via the most general H5N1 taxonomy.

When downloading all sequences for H5N1 taxonomy 102793, I get 30833 sequences with H5N1 in the name

When downloading all influenza sequences with taxonomy 197911, I get 64338 sequences that have H5N1 in the name.

This is unexpected. Half of the H5N1 sequences seem to be wrongly classified.

I would expect that when querying for a taxon id, one gets all the sequences in that taxon and in its children.

To Reproduce
Steps to reproduce the behavior:

$ datasets download virus genome taxon 102793 > 102793.zip

$ unzip -p 102793.zip ncbi_dataset/data/genomic.fna | grep "H5N1" | wc -l
30833

$ datasets download virus genome taxon 197911 > 197911.zip

$ unzip -p 197911.zip ncbi_dataset/data/genomic.fna | grep "H5N1" | wc -l
64338
@corneliusroemer corneliusroemer added the bug Something isn't working label Oct 1, 2024
@anna-parker
Copy link

Hi! I also came across this, in NCBI Virus I can filter influenza A for genotype H5N1 and this returns over 60k sequences: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Alphainfluenzavirus,%20taxid:197911&Serotype_s=H5N1
image

Whereas searching for the taxonId of the H5N1 subtype (102793) I only get approx 30k sequences:
image

@ericcox1
Copy link
Collaborator

ericcox1 commented Oct 1, 2024

Hi @corneliusroemer and @anna-parker,

Thanks for opening this issue.

The reason for the discrepancy in counts you observed is that sequences for influenza are no longer submitted at the serotype levels.

Currently, the best method for retrieving H5N1 sequences is to use a query similar to what is shown by @anna-parker above, but we can be more specific and query for the species-level alphainfluenzavirus influenzae instead of the genus:

  1. Search NCBI Virus using taxonomy = alphainfluenzavirus influenzae, then
  2. Use the genotype filter to select H5N1

Here are the results in NCBI Virus.

From there, you could pass the accession list to the datasets command-line tool, or download directly from NCBI Virus.

In the future, we plan to add the genotype field to the virus data report. See #389 (comment)

Best,
Eric

@ericcox1 ericcox1 closed this as completed Oct 1, 2024
@corneliusroemer
Copy link
Author

Thanks, that makes sense. I didn't know that assigning to subtypes was abandoned. Would be useful indeed if the genotype download was possible at some point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants