Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Support of Yeast #13

Open
ypriverol opened this issue Oct 30, 2017 · 3 comments
Open

Question about the Support of Yeast #13

ypriverol opened this issue Oct 30, 2017 · 3 comments
Labels

Comments

@ypriverol
Copy link
Collaborator

ypriverol commented Oct 30, 2017

Hi @cschlaffner :

Can you explain in details why we can not do the mapping to taxonomies like Yeast or E.coli. This issue can be used to trigger the discussion with ENSEMBL and explain them the problem we are facing. We have more than 10 projects of Yeast we would like to be able to map to ENSEMBL.

@cschlaffner
Copy link
Owner

Hi @ypriverol

I have checked for the two species you mentioned in your question. We can rewrite the gtf and fasta parser but the chromosome/plasmid IDs and additional specifications are required.

Chromosome IDs (chromosome names such as 1,2,3,4,... in human) are very different in different species. I would require a list of all chromosome names for primary assembly and patches, haplotypes etc. (primary assembly highlighted). That is the most important requirement to address to enable PoGo to map for additional species.

Additional specifications:

GTF:

  • gene line holds gene_id in the description column as - gene_id "gene_id";
  • transcript line holds gene_id and transcript_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id";
  • CDS line holds gene_id, transcript_id, and exon_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id"; exon_id "exon_id";

FASTA:

  • every fasta header contains gene_id and transcript_id as - gene:gene_id transcript:transcript_id

Currently PoGo supports GENCODE annotation. However, GENCODE does not follow the structure in the fasta file as described above. I will start the discussion with GENCODE to enable novel mapping for annotation purposes.

It would be great if you could ask ENSEMBL for confirmation of the above specifications for all species in Ensembl Genomes, Ensembl Bacteria, Ensembl Protists, Ensembl Fungi, Ensembl Plants, Ensembl Metazoa and Ensembl (vertebrates). Also a full list of all primary assembly and patch/haplotype etc. chromosome names for all species in Ensembl and the sub Ensembl sites is required.

@ypriverol
Copy link
Collaborator Author

Thanks @cschlaffner for your quick reply. In order to move this forward and also understand this better some questions here:

Chromosome IDs (chromosome names such as 1,2,3,4,... in human) are very different in different species.

Can you out an example here?

  • gene line holds gene_id in the description column as - gene_id "gene_id";
  • transcript line holds gene_id and transcript_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id";
  • CDS line holds gene_id, transcript_id, and exon_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id"; exon_id "exon_id";

This is not the case in GTF files for these species? Can you put an example?

Regards
Yasset

@cschlaffner
Copy link
Owner

@ypriverol

Chromosome IDs in different species:

  • Human: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, KI270713.1, KI270711.1, GL000195.1, GL000219.1, GL000216.2, ...
  • Yeast: I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, Mito
  • E.coli: Chromosome, pHUSEC2011-1, pHUSEC2011-2, pHUSEC2011-3
  • gene line holds gene_id in the description column as - gene_id "gene_id";
  • transcript line holds gene_id and transcript_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id";
  • CDS line holds gene_id, transcript_id, and exon_id in the description column as - gene_id "gene_id"; transcript_id "transcript_id"; exon_id "exon_id";

As for the GTF sctructure. I have seen that exon_id "exon_id" is variable and sometimes jumps from the CDS line to the exon line and vice versa specifically between Ensembl and GENCODE get files.

Also I just need confirmation from Ensembl that the gene_id and transcript_id is used as described for all species in Ensembl without exception. If Ensembl ensures that structure, e.g. through their internal release code, then I do not have to download all GTF files and parse through all of them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants