-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect stop codon #175
Comments
Hi dleopold, Thank you very much for reporting the issue. I will check it when I have more time. I agree with your point "It would seem more appropriate to stop at the same position as the reference if possible, particularly if the reference is highly similar" if the annotation of reference is correct (but in most cases, we don't know the truth). MitoZ checks the initial CDS ranges from the file So, we can look at the file Secondly, if it hits the reference genome (NC_030597), the gene length information might start to have an effect on the annotation process (the COX2 on NC_030597 might be too short). You can compare the COX2 on NC_030597 with the COX2 of other closely related species, do a multiple sequence alignment (codon-based version), and see if all species end at the same position. |
In this case, the "...position.cds" file shows that the sequence being annotated is hitting itself in the reference database for COX2. Here is the relevant part of the file:
However, although the alignment occurs at positions 7206:7895, the final annotation of the gene is 7206:7916, extending 19bp into the tRNA-Lys despite a T/TA in the 'correct' stop position. As far as I can tell this is not due to an issue with length - I downloaded mitogenomes for 4 similar taxa and they all align very well (codon or nucleotide based alignment) and end at the same T or TA. They are also all the same length, only differing by 1 bp if the stop codon was identified as T or TA. I do realize that we often do not know the "correct" position when working with novel taxa, but I chose this Ref Seq mitogenome as a reproducible example of something that I am seeing quite frequently in my data (~800 fish mitogenomes). To me, this seems like a clear case where the stop position in the original annotation should be properly re-identified by MitoZ. Perhaps I am missing something about how length enters into the delineation of the start/stop positions? I did not see anything about that in the original paper or documentation, so I am just trying to understand why the annotation is incorrect. |
When I check this COX2 of NC_030597.1, I do find a standard stop codon after the extension (the first line below). We can check (at the protein level) if this is also true for the other 4 similar taxa. If all taxa show a similar pattern (conserved proteins at the extended region), then we can say probably that the NCBI Ref annotation for this gene is wrong. |
Perhaps in the future, we can add an option to ask MitoZ not to do such a greedy extension if the users don't want, to make the annotation result more similar to the reference gene (although it's not necessary to be correct). |
It looks like you compared the results with a similar set of reference sequences as the ones I used. Thank you for looking into it. I don't have any additional data to determine which is the 'correct' annotation. However, it appears that for this gene, in the Holocentridae, the NCBI annotations are fairly consistent, all ending before the neighboring tRNA. This is also consistent with annotations produced by other annotation software (e.g. Mitos2, MitoAnnotator). I have also had many of my NCBI submissions rejected when the annotation differs from all of the annotations of currently accepted submissions in this way (for this and other taxonomic groups / genes). So, whether or not there is uncertainty about which is correct, there does seem to be a prevailing consensus. |
Thanks for your information. Maybe we can try to first determine the boundaries of tRNA genes and then set some constraints on the boundaries of PCGs, to avoid the overlapping of PCGs and tRNAs. (However, I remember that at least of some PCGs of some clades like ATP6 and ATP8 genes, the overlapping of genes happens, right?) But anyway, if the goal of studies is for phylogenetic analysis using PCGs, or gene order rearrangement analysis, such kind of issues don't matter at all as far as I can see. |
MitoZ version?
3.4
How did you install MitoZ? (e.g. Docker, Udocker, Singularity, Conda-Pack, Conda, or source code)
Conda
Did you run a test after your installation, and was the test run okay?
Yes.
The command you used?
e.g., mitoz annotate --outprefix NC_030597 --clade Chordata --fastafiles NC_030597.fasta
Problem description
I have been using MitoZ to annotate a large number of fish mitogenomes. I have noticed that the stop codons are sometimes not properly identified and the gene annotation improperly extends into the next gene/feature. This primarily happens when the stop codons are incomplete (T/TA), and an alternative potnential stop codon could be identified further upstream. It also happens even when there is a highly similar, or even identical reference. For example, I have downloaded the complete fasta for NC_030597 (Sargocentron spiniferum) from genbank, which is also included in the Chordata PCG reference database for MitoZ. I then run MitoZ to re-annotate it. The resulting annotation for COX2 extends 20 bp beyond the (identical) reference and into the neighboring tRNA-Lys. It would seem more appropriate to stop at the same position as the reference if possible, particularly if the reference is highly similar and the annotation would otherwise extend into another element on the same strand.
I have also noticed similar behavior for start codons, though less often.
Log messages from MitoZ (stdout and stderr)
NA
The text was updated successfully, but these errors were encountered: