-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive Number of Pseudo Genes #69
Comments
Our current logic doesn't trust RNA-seq to decide whether a pseudogene is transcribed or not given how pseudogenes often are fairly similar to their parent genes. So they might have transcript support. It's an item on our list for future refinement, but probably won't get to it for a while. I'm guessing this genome is polyploid? How recent is the polyploidization? I'm wondering if the high pseudogene count is a reflection of degradation of some of the genome copies. |
Yes, this is a polyploid, and the polyploidization occurred about 500,000 years ago. Can I use PASA to update the annotation of these pseudogenes? |
I don't know what PASA would do, especially for the pseudogenes. If they're getting annotated as pseudogenes it's likely because there's something wrong with the annotations (frameshift or internal stops). Is this a public assembly that we could run some tests on? If not, we can pick some other polyploids and see if they reproduce the effect. |
This genome has not been made public yet. It is an allopolyploid with a genome size of approximately 4.7 Gb. Due to the relatively recent polyploidization event, the two subgenomes have a low level of fusion. Using other pipelines, over 90,000 genes were annotated. This number basically matches the sum of the gene counts of the existing species of the two ancestral species. When using the default protein sequence database of EGAPX along with RNA-seq data, the gene annotation results are as follows.
At the beginning of this issue, the annotation was performed using a custom protein sequence database as input, which included protein sequences from species within the same genus as this organism. Which result should be more accurate? |
Providing same-genus proteins will introduce substantial bias in the results. In general, those proteins are highly likely to align and generate annotation, whether they are good annotations or not. EGAPx does some filtering for suspect proteins via comparison to the RNA-seq data, but some genes have restricted expression so we don't require everything to be expressed. So we want to have confidence that the input proteins are real. The high pseudogene count you originally got is likely a reflection of those same-genus proteins aligning to this species, but with defects. In general, real, well-annotated proteins will align well to other species without internal stop codons or frameshifts. If they have such defects, it can indicate that the annotations weren't correct, or they aren't even protein-coding genes in the first place. One possibility is that there are many annotations on transposons, and in this new genome those transposon proteins are found but many of them are defective. Or that gene set includes all found open reading frames above some size without further consideration of protein-coding potential. Aligning to another closely related genome with defects suggests the sequence is not under selective constraint, which makes it less likely to be a functional protein. Are there annotations for either of the ancestral species in the RefSeq dataset (EGAP annotations)? Note the RefSeq datasets in plants tend to be inflated for protein-coding genes because our mechanisms for removal of transposon-based proteins don't work as well as we'd like in plants. This noise has increased in recent years as genomes have improved and include more repeat sequence. In EGAPx we're omitting some logic that is the major source of that noise, and we expect smaller coding gene counts. What taxonomic order is your genome in? Which protein dataset is EGAPx picking for annotation? |
Thank you for your reply! Two questions:
In the annotation results, I also considered the potential impact of transposon proteins and identified transposon proteins in the annotation results based on their domains using TEsorter. The results showed a relatively low content of transposon proteins (around 1500 in total). Does this suggest that the pseudogenes annotated by EGAPx are the result of a "competition" between the two subgenomes (i.e., for alleles between the two subgenomes, only one is protein-coding while the other becomes a pseudogene)? |
That's a very timely example. We've annotated four tobacco genomes in the RefSeq dataset with EGAP: N. sylvestris and N. tomentosiformis are very recent runs, with high-contiguity ONT assemblies. We annotated 43-45k protein-coding genes in each. However, I'll say that we debated quite a bit about likely transposon content in the results and whether to change some settings, ultimately deciding to release them in this state but revisit later when we can work out logic to better filter out possible noise. Here are the EGAP reports for those two: The number of mRNAs reported "with > 5% ab initio" is unusually high for these two. What that report doesn't include is that nearly all of those models are completely based on ab initio prediction, and lack RNA-seq support. This is very unusual for an EGAP annotation. EGAPx doesn't annotate those models at all. If we exclude those the coding gene counts are more like 25k each, more in line with half of the 62k you're seeing. But that also seemed a little low, and had a small effect on BUSCO counts (but only ~1% if I recall correctly). So that's why we went ahead and released with all the ab initio genes, even though we think a good number of them are probably not real. Looking at the GO annotation for N. sylvestris:
So bottom line is I don't know what to think the true number of protein coding genes is in tobacco, but I suspect it's closer to 30k than 45k per genome and the 62k you're seeing might be about right. But it would take a much deeper dive including sampling of different annotation sets to be sure one way or the other. |
Thank you for your patient explanation! Perhaps more discussion is needed to make EGAPx even more perfect. I have previously used Helixer (It utilizes Deep Neural Networks and a Hidden Markov Model to directly provide primary gene models in a gff3 file) to annotate Nicotiana sylvestris and Nicotiana tomentosiformis, resulting in gene numbers of 33,369 and 29,941, respectively, with BUSCO scores exceeding 95% for both. I am not sure whether these results can contribute to the optimization of EGAPx. I also have a question: since pseudogenes are classified as non-protein-coding genes, what potential functions might they still perform? In other words, what is the significance of annotating pseudogenes? Another question: How can I distinguish between ab initio genes and evidence-based genes in the current version of EGAPx? |
Hello! I have noticed that there is a significant number of "non-transcribed pseudo" genes in the final results. I would like to inquire whether these genes, defined as "non-transcribed pseudo," are those that lack RNA-seq support.
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: