Excessive Number of Pseudo Genes #69

MrbrilliantLL · 2024-12-15T09:54:06Z

Hello! I have noticed that there is a significant number of "non-transcribed pseudo" genes in the final results. I would like to inquire whether these genes, defined as "non-transcribed pseudo," are those that lack RNA-seq support.

Thank you for your assistance!

Overall Counts:
genes: 95702
genes (other): 0
genes (non-transcribed pseudo): 35468
genes (transcribed pseudo): 0
genes (has variants): 8762
genes (partial): 360
genes (Ig TCR segment): 0
genes (non coding): 2844
genes (protein coding): 57390
genes (major correction): 426
genes (minor correction): 0
genes (premature stop): 269
genes (has frameshifts): 299

murphyte · 2024-12-16T16:24:41Z

Our current logic doesn't trust RNA-seq to decide whether a pseudogene is transcribed or not given how pseudogenes often are fairly similar to their parent genes. So they might have transcript support. It's an item on our list for future refinement, but probably won't get to it for a while.

I'm guessing this genome is polyploid? How recent is the polyploidization? I'm wondering if the high pseudogene count is a reflection of degradation of some of the genome copies.

MrbrilliantLL · 2024-12-17T01:06:43Z

Yes, this is a polyploid, and the polyploidization occurred about 500,000 years ago. Can I use PASA to update the annotation of these pseudogenes?

murphyte · 2024-12-17T12:46:00Z

I don't know what PASA would do, especially for the pseudogenes. If they're getting annotated as pseudogenes it's likely because there's something wrong with the annotations (frameshift or internal stops). Is this a public assembly that we could run some tests on? If not, we can pick some other polyploids and see if they reproduce the effect.

MrbrilliantLL · 2024-12-19T03:29:42Z

This genome has not been made public yet. It is an allopolyploid with a genome size of approximately 4.7 Gb. Due to the relatively recent polyploidization event, the two subgenomes have a low level of fusion.

Using other pipelines, over 90,000 genes were annotated. This number basically matches the sum of the gene counts of the existing species of the two ancestral species. When using the default protein sequence database of EGAPX along with RNA-seq data, the gene annotation results are as follows.

genes: 62391
genes (other): 0
genes (non-transcribed pseudo): 8133
genes (transcribed pseudo): 0
genes (has variants): 8765
genes (partial): 164
genes (Ig TCR segment): 0
genes (non coding): 3058
genes (protein coding): 51200
genes (major correction): 95
genes (minor correction): 0
genes (premature stop): 41
genes (has frameshifts): 74

At the beginning of this issue, the annotation was performed using a custom protein sequence database as input, which included protein sequences from species within the same genus as this organism.

Which result should be more accurate?

murphyte · 2024-12-19T12:52:37Z

Providing same-genus proteins will introduce substantial bias in the results. In general, those proteins are highly likely to align and generate annotation, whether they are good annotations or not. EGAPx does some filtering for suspect proteins via comparison to the RNA-seq data, but some genes have restricted expression so we don't require everything to be expressed. So we want to have confidence that the input proteins are real.

The high pseudogene count you originally got is likely a reflection of those same-genus proteins aligning to this species, but with defects. In general, real, well-annotated proteins will align well to other species without internal stop codons or frameshifts. If they have such defects, it can indicate that the annotations weren't correct, or they aren't even protein-coding genes in the first place. One possibility is that there are many annotations on transposons, and in this new genome those transposon proteins are found but many of them are defective. Or that gene set includes all found open reading frames above some size without further consideration of protein-coding potential. Aligning to another closely related genome with defects suggests the sequence is not under selective constraint, which makes it less likely to be a functional protein.

Are there annotations for either of the ancestral species in the RefSeq dataset (EGAP annotations)? Note the RefSeq datasets in plants tend to be inflated for protein-coding genes because our mechanisms for removal of transposon-based proteins don't work as well as we'd like in plants. This noise has increased in recent years as genomes have improved and include more repeat sequence. In EGAPx we're omitting some logic that is the major source of that noise, and we expect smaller coding gene counts.

What taxonomic order is your genome in? Which protein dataset is EGAPx picking for annotation?

MrbrilliantLL · 2024-12-20T01:49:20Z

Thank you for your reply! Two questions:

The RefSeq dataset I provided does not include protein sequence information for the ancestral species.
The Taxonomy ID for this species is 4093. Below is the YAML file used for the annotation process, which includes the protein dataset automatically selected by EGAPx.

ortho:
genomic.fna: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/genomic.fna.gz
genomic.gff: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/genomic.gff.gz
name_from.rpt: /data/tools/egapx/local_cache/ortholog_references/2/3702/name_from_ortholog.rpt
protein.faa: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/protein.faa.gz
taxid: 3702
proteins: /data/tools/egapx/local_cache/target_proteins/2/71274.faa.gz
proteins_trusted: /data/tools/egapx/local_cache/target_proteins/2/71274.trusted_proteins.gi
reference_sets: /data/tools/egapx/local_cache/reference_sets/2/swissprot.asnb.gz

In the annotation results, I also considered the potential impact of transposon proteins and identified transposon proteins in the annotation results based on their domains using TEsorter. The results showed a relatively low content of transposon proteins (around 1500 in total).

Does this suggest that the pseudogenes annotated by EGAPx are the result of a "competition" between the two subgenomes (i.e., for alleles between the two subgenomes, only one is protein-coding while the other becomes a pseudogene)?

murphyte · 2024-12-20T22:50:42Z

That's a very timely example. We've annotated four tobacco genomes in the RefSeq dataset with EGAP:
https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=4085&annotated_only=true&refseq_annotation=true

N. sylvestris and N. tomentosiformis are very recent runs, with high-contiguity ONT assemblies. We annotated 43-45k protein-coding genes in each. However, I'll say that we debated quite a bit about likely transposon content in the results and whether to change some settings, ultimately deciding to release them in this state but revisit later when we can work out logic to better filter out possible noise. Here are the EGAP reports for those two:
https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Nicotiana_sylvestris/GCF_000393655.2-RS_2024_11/
https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Nicotiana_tomentosiformis/GCF_000390325.3-RS_2024_10/

The number of mRNAs reported "with > 5% ab initio" is unusually high for these two. What that report doesn't include is that nearly all of those models are completely based on ab initio prediction, and lack RNA-seq support. This is very unusual for an EGAP annotation. EGAPx doesn't annotate those models at all. If we exclude those the coding gene counts are more like 25k each, more in line with half of the 62k you're seeing. But that also seemed a little low, and had a small effect on BUSCO counts (but only ~1% if I recall correctly). So that's why we went ahead and released with all the ab initio genes, even though we think a good number of them are probably not real.

Looking at the GO annotation for N. sylvestris:

there are 21k ab initio genes, and 24k that are evidence based
7k of the ab initio genes (33%) have GO annotation. 19k of the evidence based ones (80%) have GO

So bottom line is I don't know what to think the true number of protein coding genes is in tobacco, but I suspect it's closer to 30k than 45k per genome and the 62k you're seeing might be about right. But it would take a much deeper dive including sampling of different annotation sets to be sure one way or the other.

MrbrilliantLL · 2024-12-21T12:54:43Z

Thank you for your patient explanation! Perhaps more discussion is needed to make EGAPx even more perfect.

I have previously used Helixer (It utilizes Deep Neural Networks and a Hidden Markov Model to directly provide primary gene models in a gff3 file) to annotate Nicotiana sylvestris and Nicotiana tomentosiformis, resulting in gene numbers of 33,369 and 29,941, respectively, with BUSCO scores exceeding 95% for both. I am not sure whether these results can contribute to the optimization of EGAPx.

I also have a question: since pseudogenes are classified as non-protein-coding genes, what potential functions might they still perform? In other words, what is the significance of annotating pseudogenes?

Another question: How can I distinguish between ab initio genes and evidence-based genes in the current version of EGAPx?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive Number of Pseudo Genes #69

Excessive Number of Pseudo Genes #69

MrbrilliantLL commented Dec 15, 2024

murphyte commented Dec 16, 2024

MrbrilliantLL commented Dec 17, 2024

murphyte commented Dec 17, 2024

MrbrilliantLL commented Dec 19, 2024

murphyte commented Dec 19, 2024

MrbrilliantLL commented Dec 20, 2024

murphyte commented Dec 20, 2024

MrbrilliantLL commented Dec 21, 2024

Excessive Number of Pseudo Genes #69

Excessive Number of Pseudo Genes #69

Comments

MrbrilliantLL commented Dec 15, 2024

murphyte commented Dec 16, 2024

MrbrilliantLL commented Dec 17, 2024

murphyte commented Dec 17, 2024

MrbrilliantLL commented Dec 19, 2024

murphyte commented Dec 19, 2024

MrbrilliantLL commented Dec 20, 2024

murphyte commented Dec 20, 2024

MrbrilliantLL commented Dec 21, 2024