-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTA transcripts with gaps #480
Comments
So while I can't find a way to flag it via UTA data, you can look up the length here: https://www.ncbi.nlm.nih.gov/nuccore/NM_001271466.3 And if it doesn't match transcript_version.length then there must be a gap, so maybe do that as well. There are 15k UTA transcripts though. Could perhaps compare against 38 ones if RefSeq/no gap and then if not equals then contact the server. |
Shariant test will go down so need to move onto laptop:
this gives 37:
and 38:
Will write a script to retrieve these off clingen and then the ones that don't match the length will put into data migration TODO:
Also have a think about whether we want to add a new model that has transcript_version length etc - then we can retrieve them as we need via API and check things are good. Thinking about it more, perhaps TranscriptVersion should be outside genome build and there is related object that holds the coordinates |
Worked out a way to retrieve the transcript lengths via API, so we can pull them down and check for length, and if different from ours, mark them as alignment_gap=True still todo:
|
Deployment notes Run in order (1 and 3 are in upgrade steps in right order, step 2 is a hack to save time)
genbank.txt is data I saved from my laptop, to save 1/2 an hour of API calls or so by copying around the data And also here: sacgf.ersa.edu.au:/data/sacgf/admin/variantgrid_setup_data/genbank.txt.gz
To read them and insert:
They have been copied to vg test /mnt/incoming - could put them in static and provide link here so that others can use them in diff environments Testing
|
Not sure if this is the right issue for this comment (so feel free to move if needed)
|
@EmmaTudini now we only go upwards with transcript versions. We inserted the very latest RefSeq into Shariant test and there is no v6 in there. If you go to the RefSeq page for the transcript: https://www.ncbi.nlm.nih.gov/nuccore/NM_007294 The most recent version is 4, like we have. There has never been a 5 or a 6 |
@davmlaw We used to go down in versions though? See text from transcript version change flag for that variant: Admin Bot NM_007294.6 (imported) |
We used to, now we only go up. If someone enters a transcript version that doesn't exist (we check RefSeq via API) something went very wrong and I think it should fail. |
@TheMadBug Is there any way to check whether this will cause issues in prod? @davmlaw If we only check refseq for transcript versions, what happens to ensembl transcripts? |
For Ensembl, we check the Ensembl API To work out the differences in prod, I could export the HGVSs and then run some code to re-match them using the current system, and see if they go somewhere different (or fail) |
Ok so there shouldn't be any way that we're out of date and a new transcript version does exist? I think that would be good. @TheMadBug unless there's another way that is easier? |
Also would a variant be rejected if a transcript version didn't exist at all or didn't exist for a specific genome build? |
If the transcript didn't exist in our DB, we'd call the API then die with a bad transcript does not exist message. If it doesn't exist for a particular genome build, current behavior is to skip it on the way to the ones that are in our genome build. #481 change is to call ClinGen with that version anyway, and try to get coordinates for our build that way |
@davmlaw Just uploaded this example to test and it resulted in an error (it's from prod and will cause a lot of issues - there are few variants on this transcripts) - https://test.shariant.org.au/classification/classification/36847 |
That works ok with ClinGen Allele Registry so should be ok once #481 is done |
@davmlaw Just to confirm - we now go straight to refseq or ensembl via the API for every transcript that we see? So even if a transcript and it's version has been seen before, we still check for any updates via the API? |
@EmmaTudini there are no update information for transcript versions - if the sequence changed the version would. We contact the API to see whether or not it exists, and when we do that we also pull the sequence down, which we can use to see if there are any alignment gaps |
Worked this one out, see: #492 ClinGen allele registry supplied non-normalized indel coordinates / bases at 1 point in time Solution will be to re-retrieve the ClinGen data and then re-match everything. |
@davmlaw I asked James to pull out the number of transcripts with the alignment gap of true vs those where it was false just to make sure that they lined up as expected. Transcripts with a gap alignment of true represent about 11% of the total transcripts. Is this higher than expected? 85289 gap = true |
Wow looks like it grew quite a lot since I retrieved all of the sequence lengths and set gaps on anything with a diff length I have no idea how many to expect. It doesn't seem unreasonable that there's that many, though it's annoying |
@davmlaw UCSC now has a refseq track called "Differences between NCBI RefSeq Transcripts and the Reference Genome". Which makes it super easy to look at alignment gaps. So I looked at a couple of examples where there was an alignment gap of true in Shariant test and found that there are some transcripts that annotated as true that don't have an alignment gap according to the new UCSC track. E.g. BAP1 NM_004656.4 (https://test.shariant.org.au/genes/view_transcript_version/NM_004656/4). Think this also applies to any of the Refseq BAP1 transcripts This is the new track that shows gaps in GRCh37 - doesn't show any gaps (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr3%3A52435020%2D52444121&hgsid=1170874253_RNevAq5NDinRsAGS3r9sLR61ahda) I wonder if doing a diff of the refseq length via the API vs looking at how long it is against the genome is creating issues? |
Moved Shariant test allele issues into: |
@EmmaTudini - If you look at the transcript version: https://test.shariant.org.au/genes/view_transcript_version/NM_004656/4 The sequence is 3600bp long, but the sum of the exons is 3599 for both genome builds. If there's no gap, where did the base go? |
Where are you retrieving the total length? If you look at .3 and .2 the gap is even bigger. I was thinking that maybe the total length that you retrieve via the API might be incorrect? |
The length is given as 3600bp on this page: https://www.ncbi.nlm.nih.gov/nuccore/NM_004656.4 But the way I calculate length is to get the sequence via Fasta or API and then count the bases - it's always matched the page (as it should!) |
@davmlaw I went back to the original file referenced in Shariant test - From having a quick look at NM_004656.4 - it seems like the bases referenced in Shariant test in the JSON dropdown are 1bp off the file. E.g. for exon 1 the coordinates in the file are - 52409842, 52410008 whereas the coordinates in shariant test are - 52409841, 52410008 Do you know why that's the case? |
Coordinates used for computer are generally zero based half closed intervals. Coordinates used for humans are often 1 based open intervals. http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/ |
In this case, it seems to change the overall length of the transcript though as one exon does not get moved. See exon 17 in the file. If you calculate the length using the original coordinates from the GCF file, the length is 3600bp which matches the refseq website. |
Good spotting - yes looks like the alignments can change over time so we should update them - moved to issue #494 |
@davmlaw Have been looking into whether having 8000 transcripts with an alignment gap makes sense and found this comment in the docs for a tool called "hgvs" - https://hgvs.readthedocs.io/en/stable/examples/using-hgvs.html?highlight=gap#projecting-in-the-presence-of-a-genome-transcript-gap "As of Oct 2016, 1033 RefSeq transcripts in 433 genes have gapped alignments. These gaps require special handlingin order to maintain the correspondence of positions in an alignment. hgvs uses the precomputed alignments in UTA to correctly project variants in exons containing gapped alignments.". I don't think it's feasible to believe that the number of gapped transcripts has grown by 85x in five years? What are your thoughts? Also it looks like genomic.gff files from refseq annotate transcripts with gaps. E.g. ALMS1 there's a note saying - "Note=The RefSeq transcript has 1 non-frameshifting indel compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=ALMS1;inference=similar to RNA sequence%2C mRNA (same species)". Not sure how consistent these annotations are, but this might be an alternative to comparing the lengths? |
Thanks for the HGVS link I added it to the wiki guide
I know, see the 1st sentence of this issue :) UTA = Universal Transcript Archive - which is what's used by that HGVS project The conversations are too long to work out what's going on - will handle everything in #494 |
FYI evaluated that library and it was super-complex to use and setup, and I couldn't get it to work locally - I raised a still open issue from 2018 - and some environments we can't make a direct call to their SQL database I really wish it had worked, or I had spent time trying to fix that issue myself, I had to re-add much of that complexity to the other HGVS library and our transcripts loading etc |
Re-opened this as we need to handle these
And put this in a file /data/incoming I think TODO:
|
I had run this by hand on test, first setting the version to be + 1000, then eventually setting it to error like below. This was manually run on prod:
|
Before - I rematched every classification that had a transcript marked by UTA as having gaps (which was a lot) I think it set a lot of classifications to have no variant before I rematched anyway, but here's the script if we ever need to run it again:
|
When writing this, and setting errors:
|
This was patched by hand in upgrade on Nov 11 Running the data migration over the manual fix shouldn't change what's used for local HGVS resolution (assuming we haven't uploaded fixed UTA transcripts yet), but it should add the message in "alignment gap" on the transcript version page. |
The data migration makes a slight change over what was done in prod manually. If you go to a UCSC/UTA transcript with a gap, it should say in "alignment gap" what the error is eg: https://test.shariant.org.au/genes/view_transcript_version/NM_032932/5 Not only doesn't work but it says in "Incorrectly converted UTA/UCSC transcript is missing alignment info" |
Think this has already been done in prod? |
Does anything else need to be done? Assuming that any new UTA transcripts will auto be assigned the gap |
The original fix was deployed to prod, though there was a change to give slightly more info in test that isn't in prod yet, see: |
@davmlaw Looks good in test, but not sure I understand what you mean by "Incorrectly converted UTA/UCSC transcript is missing alignment info". Why is it incorrectly converted? I thought that we just didn't convert UTA/UCSC transcripts? |
UTA has exon coordinates and alignment gaps (in their own format) When I converted UTA transcripts to our format long ago, I didn't take the alignments, so it doesn't handle gaps properly. So, it was correct on their end, but I converted it incorrectly |
I went through the RefSeq GFFs and searched for any that had an alignment gap. I didn't do this for UTA transcripts.
An example is NM_001271466.3 for GRCh37 is from UTA, and has different length from GRCh38 - which doesn't have a gap. It's likely that the UTA version has a gap but just wasn't flagged.
If we can query UTA to look for affected transcripts (or they had a forum post with affected ones I think) then set them via a DB migration
The text was updated successfully, but these errors were encountered: