-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Genes without locus tag in GCF_030052815.1 #397
Comments
Hi @manulera - sorry for the delay in responding, this issue seems to have slipped by. I've looked at the underlying data that we use for these genes and it looks like there isn't any for those examples you gave (as an example, if you look at the same organism without the filter: https://ncbi.nlm.nih.gov/datasets/gene/GCF_030052815.1/ you will see locus tags). So bottom line this looks like this is some type of data curation issue. We're bringing it up internally to see if we can get this fixed. Someone will update here once we have some more information. As for your second question, the GenBank record is the original submitted reference while the RefSeq record is a copy of this with different types of added value (which can lead to different names, etc.). The idea is not to interfere with the original submission but to be able to add various types of curation. Hope that helps! Thanks! |
Hi @manulera,
We don’t assign locus_tags when annotating genomes using the NCBI Eukaryotic Genome Annotation Pipeline (EGAP), which covers nearly all animal and plant genomes in the RefSeq dataset.
Nearly all animal and plant genomes in RefSeq are annotated with EGAP independent of what’s provided in GenBank. Differences in annotation due to software or evidence sets are not uncommon. See our recent paper for more info: NCBI RefSeq: reference sequence standards through 25 years of curation and annotation Best, |
Hi @ericcox1 and @syntheticgio thanks for the followup and the references! Will keep that in mind |
Hello,
I have two questions. Not sure if this is a generic NCBI issue, or related to the datasets API. Happy to forward the query elsewhere.
I came across this problem recently for the genome of Hevea brasiliensis - taxid 3981 - reference genome assembly GCF_030052815.1.
I thought that having
locus_tag
s was a requirement for genomes to be deposited / queried in the NCBI. However, it seems like the genes in the nuclear genome of this assembly do not have locus_tags:https://ncbi.nlm.nih.gov/datasets/gene/GCF_030052815.1/?search=rubber
Question 1: is it to be expected that
locus_tags
are missing, or is it an issue with this assembly in particular?I went to the refseq (https://www.ncbi.nlm.nih.gov/nuccore/NC_079493.1/) and GenBank (https://www.ncbi.nlm.nih.gov/nuccore/CM057502.1?report=genbank&log$=seqview) records. Below is an example of the same CDS in both records:
NC_079493
CM057502
Here the sequence of both is identical but only one has a locus_tag. There are also cases where there are features that exist in one but not the other.
Question 2: is it common that the annotations in GenBank and RefSeq records differ?
Thank you so much for your help!
Best,
Manu
The text was updated successfully, but these errors were encountered: