-
Notifications
You must be signed in to change notification settings - Fork 5
cdot vs UTA
Dave Lawrence edited this page Feb 3, 2022
·
2 revisions
cdot and Universal Transcript Archive have similar goals of providing transcripts for loading HGVS, but they approach it from different ways:
- UTA aligns sequences, then stores coordinates in an SQL database.
- cdot convert existing Ensembl/RefSeq GTFs into JSON
RefSeq transcripts sequences can differ from the genome sequence, which means they can align with gaps. Prior to v105 (GRCh37.p13) RefSeq did not provide alignment gap information, so UTA was forced to do their own alignment to get CIGAR strings, to correctly handle these gaps.
From v105 onwards, RefSeq provide these gaps - making it possible to use the GFFs.
- UTA can map GRCh37 sequences to GRCh38 and vice-versa
- UTA can account for alignment gaps in earlier RefSeq releases (cdot uses these UTA transcripts - thanks!)
- Drastically simpler workflow - meaning we can load more transcripts
- Alignments exactly match those in official releases
There's a bit of redundancy in JSON, but:
- You can copy flat files around without dealing with Docker/PostgreSQL/database schemas etc.
- It's trivial to write a REST server and the client already consumes JSON
- It's lightning fast to load into RAM in Python