Skip to content

Transcript JSON format

Dave Lawrence edited this page Feb 1, 2022 · 4 revisions

Intro

In Bioinformatics, gene/transcript information is traditionally stored in GTF/GFF - this is good for genome browsers, and can be manipulated by unix tools, but it's very verbose and not easy to use.

JSON has become the standard for the web, and Python has very fast implementations - so storing gene/transcript information in JSON seems like a good thing to do.

Ideally, we'd like to make a standard, and will pitch GA4GH about it.

cdot Transcript JSON format

There are some example transcripts on http://cdot.cc

The JSON.gz files contain a bit of extra metadata, but are mostly just a dictionary with keys being transcript versions and values being what you see on the web API above.

The transcript format was originally built for the PyReference project, then adopted to PyHGVS, and finally to HGVS and cdot (REST) - having multiple consumers makes us more honest and portable.

An example of the JSON file can be found here:

https://github.com/SACGF/cdot/blob/main/tests/test_data/cdot.refseq.grch37.json

TODO

Exons are sorted in genomic order

Potential changes / issues

  • Our "exons" array keeps track of transcript start/end, we could rebuild this out of the alignment gaps
  • Our "exons" array uses gap=None to indicate a perfect alignment. BioCommons HGVS always provides eg "M100" for perfect match 100 bases.
  • Our "exons" array contains exon_id, this could be done at runtime via a reverse/enumerate
  • Genome build patches may have base changes which alter splice sites. This may make historical GTFs subtly wrong
  • In cdot_json merge_builds we throw away earlier coordinates if there's a conflict - we should look more into this to work out what's going wrong (I assume it's due to build sequence changes altering exons etc but probably worth investigating)
  • We store coordinates for different genome builds, if we store it by contig, we'd remove the redundancy for shared contigs (eg chrM). This makes it slightly harder to read for humans who have to work out what contig is for each build though.

Other examples

Ensembl have

https://rest.ensembl.org/documentation/info/lookup

Looks to have enough for us to be able to create records for HGVS, eg:

https://rest.ensembl.org/lookup/id/ENSG00000179348?expand=1;content-type=application/json

However Ensembl do not provide all historical transcript versions