You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.
This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:
NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:
To resolve the gaps, pyHGVS transcripts now need to contain the cDNA_match information from the GFF files. I've made this data available for download (or you can produce your own) see #26 (comment)
Instead of adding a separate code path to handle alignment gaps, I treat exons without gaps as cDNA matches with 100% alignment. This keeps all code running through the same path so should minimise future work.
The work was done on top of a lot of my existing changes (fixing other bugs etc) but if the project starts accepting pull requests again, please let me know and I can make a patch against master.
RefSeq transcript sequences can be different from the reference sequence (even if they agree with 1 build they can be different across builds). These sequences are aligned against the genome to produce exon coordinates in GFF releases.
This alignment can sometimes produce insertions / deletions (5-10% of transcripts), eg in the GFF file there is a “cDNA match” string that records the alignment, and has a “Gap” entry:
NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there was 185 bases matched, 3 bases inserted then back to matching. You can see how this affects PyHGVS conversion downstream from the gaps:
2:73385942 A>T: NM_015120.4(ALMS1):c.74A>T (correct)
2:73385943 A>T: NM_015120.4(ALMS1):c.75A>T (off by 3, VEP gives NM_015120.4:c.78A>T)
2:73385944 G>C: NM_015120.4(ALMS1):c.76G>C (off by 3, VEP gives NM_015120.4:c.79G>C)
The text was updated successfully, but these errors were encountered: