Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive Number of Pseudo Genes #69

Open
MrbrilliantLL opened this issue Dec 15, 2024 · 8 comments
Open

Excessive Number of Pseudo Genes #69

MrbrilliantLL opened this issue Dec 15, 2024 · 8 comments

Comments

@MrbrilliantLL
Copy link

Hello! I have noticed that there is a significant number of "non-transcribed pseudo" genes in the final results. I would like to inquire whether these genes, defined as "non-transcribed pseudo," are those that lack RNA-seq support.

Thank you for your assistance!

Overall Counts:
genes: 95702
genes (other): 0
genes (non-transcribed pseudo): 35468
genes (transcribed pseudo): 0
genes (has variants): 8762
genes (partial): 360
genes (Ig TCR segment): 0
genes (non coding): 2844
genes (protein coding): 57390
genes (major correction): 426
genes (minor correction): 0
genes (premature stop): 269
genes (has frameshifts): 299

@murphyte
Copy link

Our current logic doesn't trust RNA-seq to decide whether a pseudogene is transcribed or not given how pseudogenes often are fairly similar to their parent genes. So they might have transcript support. It's an item on our list for future refinement, but probably won't get to it for a while.

I'm guessing this genome is polyploid? How recent is the polyploidization? I'm wondering if the high pseudogene count is a reflection of degradation of some of the genome copies.

@MrbrilliantLL
Copy link
Author

Yes, this is a polyploid, and the polyploidization occurred about 500,000 years ago. Can I use PASA to update the annotation of these pseudogenes?

@murphyte
Copy link

I don't know what PASA would do, especially for the pseudogenes. If they're getting annotated as pseudogenes it's likely because there's something wrong with the annotations (frameshift or internal stops). Is this a public assembly that we could run some tests on? If not, we can pick some other polyploids and see if they reproduce the effect.

@MrbrilliantLL
Copy link
Author

This genome has not been made public yet. It is an allopolyploid with a genome size of approximately 4.7 Gb. Due to the relatively recent polyploidization event, the two subgenomes have a low level of fusion.

Using other pipelines, over 90,000 genes were annotated. This number basically matches the sum of the gene counts of the existing species of the two ancestral species. When using the default protein sequence database of EGAPX along with RNA-seq data, the gene annotation results are as follows.

genes: 62391
genes (other): 0
genes (non-transcribed pseudo): 8133
genes (transcribed pseudo): 0
genes (has variants): 8765
genes (partial): 164
genes (Ig TCR segment): 0
genes (non coding): 3058
genes (protein coding): 51200
genes (major correction): 95
genes (minor correction): 0
genes (premature stop): 41
genes (has frameshifts): 74

At the beginning of this issue, the annotation was performed using a custom protein sequence database as input, which included protein sequences from species within the same genus as this organism.

Which result should be more accurate?

@murphyte
Copy link

Providing same-genus proteins will introduce substantial bias in the results. In general, those proteins are highly likely to align and generate annotation, whether they are good annotations or not. EGAPx does some filtering for suspect proteins via comparison to the RNA-seq data, but some genes have restricted expression so we don't require everything to be expressed. So we want to have confidence that the input proteins are real.

The high pseudogene count you originally got is likely a reflection of those same-genus proteins aligning to this species, but with defects. In general, real, well-annotated proteins will align well to other species without internal stop codons or frameshifts. If they have such defects, it can indicate that the annotations weren't correct, or they aren't even protein-coding genes in the first place. One possibility is that there are many annotations on transposons, and in this new genome those transposon proteins are found but many of them are defective. Or that gene set includes all found open reading frames above some size without further consideration of protein-coding potential. Aligning to another closely related genome with defects suggests the sequence is not under selective constraint, which makes it less likely to be a functional protein.

Are there annotations for either of the ancestral species in the RefSeq dataset (EGAP annotations)? Note the RefSeq datasets in plants tend to be inflated for protein-coding genes because our mechanisms for removal of transposon-based proteins don't work as well as we'd like in plants. This noise has increased in recent years as genomes have improved and include more repeat sequence. In EGAPx we're omitting some logic that is the major source of that noise, and we expect smaller coding gene counts.

What taxonomic order is your genome in? Which protein dataset is EGAPx picking for annotation?

@MrbrilliantLL
Copy link
Author

Thank you for your reply! Two questions:

  1. The RefSeq dataset I provided does not include protein sequence information for the ancestral species.
  2. The Taxonomy ID for this species is 4093. Below is the YAML file used for the annotation process, which includes the protein dataset automatically selected by EGAPx.

ortho:
genomic.fna: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/genomic.fna.gz
genomic.gff: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/genomic.gff.gz
name_from.rpt: /data/tools/egapx/local_cache/ortholog_references/2/3702/name_from_ortholog.rpt
protein.faa: /data/tools/egapx/local_cache/ortholog_references/2/3702/current/protein.faa.gz
taxid: 3702
proteins: /data/tools/egapx/local_cache/target_proteins/2/71274.faa.gz
proteins_trusted: /data/tools/egapx/local_cache/target_proteins/2/71274.trusted_proteins.gi
reference_sets: /data/tools/egapx/local_cache/reference_sets/2/swissprot.asnb.gz

In the annotation results, I also considered the potential impact of transposon proteins and identified transposon proteins in the annotation results based on their domains using TEsorter. The results showed a relatively low content of transposon proteins (around 1500 in total).

Does this suggest that the pseudogenes annotated by EGAPx are the result of a "competition" between the two subgenomes (i.e., for alleles between the two subgenomes, only one is protein-coding while the other becomes a pseudogene)?

@murphyte
Copy link

That's a very timely example. We've annotated four tobacco genomes in the RefSeq dataset with EGAP:
https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=4085&annotated_only=true&refseq_annotation=true

N. sylvestris and N. tomentosiformis are very recent runs, with high-contiguity ONT assemblies. We annotated 43-45k protein-coding genes in each. However, I'll say that we debated quite a bit about likely transposon content in the results and whether to change some settings, ultimately deciding to release them in this state but revisit later when we can work out logic to better filter out possible noise. Here are the EGAP reports for those two:
https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Nicotiana_sylvestris/GCF_000393655.2-RS_2024_11/
https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Nicotiana_tomentosiformis/GCF_000390325.3-RS_2024_10/

The number of mRNAs reported "with > 5% ab initio" is unusually high for these two. What that report doesn't include is that nearly all of those models are completely based on ab initio prediction, and lack RNA-seq support. This is very unusual for an EGAP annotation. EGAPx doesn't annotate those models at all. If we exclude those the coding gene counts are more like 25k each, more in line with half of the 62k you're seeing. But that also seemed a little low, and had a small effect on BUSCO counts (but only ~1% if I recall correctly). So that's why we went ahead and released with all the ab initio genes, even though we think a good number of them are probably not real.

Looking at the GO annotation for N. sylvestris:

  • there are 21k ab initio genes, and 24k that are evidence based
  • 7k of the ab initio genes (33%) have GO annotation. 19k of the evidence based ones (80%) have GO

So bottom line is I don't know what to think the true number of protein coding genes is in tobacco, but I suspect it's closer to 30k than 45k per genome and the 62k you're seeing might be about right. But it would take a much deeper dive including sampling of different annotation sets to be sure one way or the other.

@MrbrilliantLL
Copy link
Author

Thank you for your patient explanation! Perhaps more discussion is needed to make EGAPx even more perfect.

I have previously used Helixer (It utilizes Deep Neural Networks and a Hidden Markov Model to directly provide primary gene models in a gff3 file) to annotate Nicotiana sylvestris and Nicotiana tomentosiformis, resulting in gene numbers of 33,369 and 29,941, respectively, with BUSCO scores exceeding 95% for both. I am not sure whether these results can contribute to the optimization of EGAPx.

I also have a question: since pseudogenes are classified as non-protein-coding genes, what potential functions might they still perform? In other words, what is the significance of annotating pseudogenes?

Another question: How can I distinguish between ab initio genes and evidence-based genes in the current version of EGAPx?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants