cdot vs UTA

cdot and Universal Transcript Archive have similar goals of providing transcripts for loading HGVS, but they approach it from different ways:

UTA aligns sequences, then stores coordinates in an SQL database.
cdot convert existing Ensembl/RefSeq GTFs into JSON

Alignment Gaps

RefSeq transcripts sequences can differ from the genome sequence, which means they can align with gaps. Prior to v105 (GRCh37.p13) RefSeq did not provide alignment gap information, so UTA was forced to do their own alignment to get CIGAR strings, to correctly handle these gaps.

From v105 onwards, RefSeq provide these gaps - making it possible to use the GFFs.

Advantages of aligning sequences

UTA can map GRCh37 sequences to GRCh38 and vice-versa
UTA can account for alignment gaps in earlier RefSeq releases (cdot uses these UTA transcripts - thanks!)

Advantages of using existing GTFs

Drastically simpler workflow - meaning we can load more transcripts
Alignments exactly match those in official releases

JSON vs SQL

There's a bit of redundancy in JSON, but:

You can copy flat files around without dealing with Docker/PostgreSQL/database schemas etc.
It's trivial to write a REST server and the client already consumes JSON
It's lightning fast to load into RAM in Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdot vs UTA

Alignment Gaps

Advantages of aligning sequences

Advantages of using existing GTFs

JSON vs SQL

Clone this wiki locally