-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS phase (offset for eg ribo slippage) #76
Comments
Also
The following transcripts are protein coding but don't have a CDS length of a multiple of 3. Some are also marked as slippage. However they don't have the overlap in CDS exons as far as I could see.
|
Here is the UTA output for
and some more
|
Yes, this is a problem and the correct thing to do would be adjust for it. The GFF3 spec has Column 8: "phase" - which I think you need to use for correct calculations. I haven't looked into it deeply, but I don't think you can just handle it by altering the start/end coordinates of a CDS (otherwise they would do that in the GFF) but rather you have to take into account this phasing in the HGVS conversion code. Given that our transcript from GFF3 is the same as the UTA transcript, it looks like they have the same trouble. They probably don't have a way to adjust for the phasing here. So - I think you need to raise it as a BioCommons HGVS issue - they will probably have to alter the UTA schema as well. CDOT We can add the GFF3 phasing score of 0/1/2 to our cdot transcripts, in anticipation for it being fixed in HGVS, so that historical data can be used for a data provider.... The most obvious place would be adding it to a field in "exons" but this would break existing code:
So we'd have to bump the major version of the data version (note to future self: use an asterisk to collect remaining fields so you can add ones in future, or avoid arrays with positional info and use dicts with keys) A non-breaking way to add this would be to have a new array that is parallel with exons, eg "phasing" that contains the offsets |
Interested to hear thoughts especially about how we basically can't do anything useful w/o HGVS I think we should add a "phasing" array to our JSON (if it's not all zeros) so it doesn't break backwards compatibility with exon fields. Until biocommons add support this is just making our files bigger for no benefit, but at least we'll be correct and maybe tools from other languages etc can use it. Then, if biocommons HGVS adds support, we can modify our data provider classes to get this into biocommons HGVS I also made an issue about breaking changes (so we can learn from our mistakes here - about using fixed length arrays where we may want to add info in the future) |
I don't get why phase is 0 in GRCh38?
So are we supposed to ignore phase column here and instead get the info from "Note=protein translation is dependent on -1 ribosomal frameshift" |
I'm puzzled and need help understanding. Consider NM_015068.3. I think that the exons given in the CDOT JSON skip one base that needs to be in the CDS as it's read double (slippage). What do you think?
Here is how it looks like in
cdot-0.2.24.refseq.grch37.json
.NCBI Gene says
The original GFF3 says
The text was updated successfully, but these errors were encountered: