Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS phase (offset for eg ribo slippage) #76

Open
holtgrewe opened this issue Mar 8, 2024 · 5 comments
Open

CDS phase (offset for eg ribo slippage) #76

holtgrewe opened this issue Mar 8, 2024 · 5 comments

Comments

@holtgrewe
Copy link
Contributor

holtgrewe commented Mar 8, 2024

I'm puzzled and need help understanding. Consider NM_015068.3. I think that the exons given in the CDOT JSON skip one base that needs to be in the CDS as it's read double (slippage). What do you think?

Here is how it looks like in cdot-0.2.24.refseq.grch37.json.

# jq '.transcripts | to_entries[] | (select(.key == "NM_015068.3"))' cdot-0.2.24.refseq.grch37.json
{
  "key": "NM_015068.3",
  "value": {
    "biotype": [
      "mRNA"
    ],
    "gene_name": "PEG10",
    "gene_version": "23089",
    "genome_builds": {
      "GRCh37": {
        "cds_end": 94294994,
        "cds_start": 94292868,
        "contig": "NC_000007.13",
        "exons": [
          [
            94285636,
            94285892,
            0,
            1,
            256,
            null
          ],
          [
            94292645,
            94299007,
            1,
            257,
            6618,
            null
          ]
        ],
        "note": "protein translation is dependent on -1 ribosomal frameshift%3B isoform 1 is encoded by transcript variant 1",
        "strand": "+",
        "url": "https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
      }
    },
    "hgnc": "14005",
    "id": "NM_015068.3",
    "protein": "NP_055883.2",
    "start_codon": 479,
    "stop_codon": 2605
  }
}

NCBI Gene says

[...]
     CDS             join(480..1436,1436..2605)
[...]

The original GFF3 says

 egrep -w 'ID=gene14272|Parent=rna16954|NM_015068.3' ref_GRCh37.p10_top_level.gff3
NC_000007.13    RefSeq  gene    94285637        94299007        .       +       .       ID=gene14272;Name=PEG10;Dbxref=GeneID:23089,HGNC:14005,MIM:609810;description=paternally expressed 10;gbkey=Gene;gene=PEG10;gene_synonym=EDR,HB-1,Mar2,Mart2,MEF3L,RGAG3
NC_000007.13    RefSeq  mRNA    94285637        94299007        .       +       .       ID=rna16953;Name=NM_015068.3;Parent=gene14272;Dbxref=GeneID:23089,Genbank:NM_015068.3,HGNC:14005,MIM:609810;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 1;transcript_id=NM_015068.3
NC_000007.13    RefSeq  exon    94285682        94285903        .       +       .       ID=id161180;Parent=rna16954;Dbxref=GeneID:23089,Genbank:NM_001172437.1,HGNC:14005,MIM:609810;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 2;transcript_id=NM_001172437.1
NC_000007.13    RefSeq  exon    94292646        94299007        .       +       .       ID=id161181;Parent=rna16954;Dbxref=GeneID:23089,Genbank:NM_001172437.1,HGNC:14005,MIM:609810;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 2;transcript_id=NM_001172437.1
NC_000007.13    RefSeq  CDS     94285899        94285903        .       +       0       ID=cds13063;Name=NP_001165908.1;Parent=rna16954;Note=isoform 3 is encoded by transcript variant 2;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;exception=ribosomal slippage;gbkey=CDS;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
NC_000007.13    RefSeq  CDS     94292646        94293825        .       +       1       ID=cds13063;Name=NP_001165908.1;Parent=rna16954;Note=isoform 3 is encoded by transcript variant 2;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;exception=ribosomal slippage;gbkey=CDS;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
NC_000007.13    RefSeq  CDS     94293825        94294994        .       +       0       ID=cds13063;Name=NP_001165908.1;Parent=rna16954;Note=isoform 3 is encoded by transcript variant 2;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;exception=ribosomal slippage;gbkey=CDS;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1
@holtgrewe
Copy link
Contributor Author

Also

NM_001172437.2
NM_001184961.1

The following transcripts are protein coding but don't have a CDS length of a multiple of 3. Some are also marked as slippage. However they don't have the overlap in CDS exons as far as I could see.

NM_004152.3
NM_001301020.1
NM_002537.3
NM_001301302.1

@holtgrewe
Copy link
Contributor Author

Here is the UTA output for NM_015068.3.

uta=> select * from uta_20210129b.tx_def_summary_v where tx_def_summary_v.tx_ac = 'NM_015068.3';
-[ RECORD 1 ]-------+---------------------------------
exon_set_id         | 349795
tx_ac               | NM_015068.3
alt_ac              | NM_015068.3
alt_aln_method      | transcript
alt_strand          | 1
hgnc                | PEG10
cds_md5             | b87df1644c80a908da9b7d8a30468084
es_fingerprint      | c0183212a58ec4219c045248df36b55b
cds_es_fp           | 95d8b77d6220eb8d205b5eb927ec114c
cds_exon_lengths_fp | 3b92d18aa7a6176dd37d372bc2f1eb71
n_exons             | 2
se_i                | 0,256;256,6618
cds_se_i            | 479,2605
starts_i            | {0,256}
ends_i              | {256,6618}
lengths             | {256,6362}
cds_start_i         | 479
cds_end_i           | 2605
cds_start_exon      | 1
cds_end_exon        | 1

and some more

uta=> select * from uta_20210129b.tx_def_summary_v where tx_def_summary_v.tx_ac = 'NM_004152.3';
-[ RECORD 1 ]-------+----------------------------------------
exon_set_id         | 349259
tx_ac               | NM_004152.3
alt_ac              | NM_004152.3
alt_aln_method      | transcript
alt_strand          | 1
hgnc                | OAZ1
cds_md5             | f5cafd85bcfead7e657f0c554125a2d2
es_fingerprint      | 35edd88f9b7d944cbda18c665b52aea7
cds_es_fp           | a6994ac96b3ecd845ff44a94fed47c18
cds_exon_lengths_fp | 85b85525c647eae35eb4deb4c26b2776
n_exons             | 5
se_i                | 0,259;259,405;405,578;578,664;664,1181
cds_se_i            | 113,259;259,405;405,578;578,664;664,801
starts_i            | {0,259,405,578,664}
ends_i              | {259,405,578,664,1181}
lengths             | {259,146,173,86,517}
cds_start_i         | 113
cds_end_i           | 801
cds_start_exon      | 0
cds_end_exon        | 4

uta=> select * from uta_20210129b.tx_def_summary_v where tx_def_summary_v.tx_ac = 'NM_001301020.1';
-[ RECORD 1 ]-------+----------------------------------------
exon_set_id         | 347618
tx_ac               | NM_001301020.1
alt_ac              | NM_001301020.1
alt_aln_method      | transcript
alt_strand          | 1
hgnc                | OAZ1
cds_md5             | 4129a9f2c8b8a5df9c8fffd30208024e
es_fingerprint      | 5f4f21d019bb1d152ef32baf9e5ec92f
cds_es_fp           | 5eab802d6b244bb49be259b5fabc424a
cds_exon_lengths_fp | 88059a870673c88c567d3f2d674fc403
n_exons             | 5
se_i                | 0,259;259,399;399,572;572,658;658,1175
cds_se_i            | 113,259;259,399;399,572;572,658;658,795
starts_i            | {0,259,399,572,658}
ends_i              | {259,399,572,658,1175}
lengths             | {259,140,173,86,517}
cds_start_i         | 113
cds_end_i           | 795
cds_start_exon      | 0
cds_end_exon        | 4

uta=> select * from uta_20210129b.tx_def_summary_v where tx_def_summary_v.tx_ac = 'NM_002537.3';
-[ RECORD 1 ]-------+----------------------------------------
exon_set_id         | 349097
tx_ac               | NM_002537.3
alt_ac              | NM_002537.3
alt_aln_method      | transcript
alt_strand          | 1
hgnc                | OAZ2
cds_md5             | e9892dbf85b12e83cfce811c6f2d79f0
es_fingerprint      | 759d543e96a31ee1bdd0e169c690d875
cds_es_fp           | 4c7bf02afed7f1a3fffaf1281a327e23
cds_exon_lengths_fp | c41e27ecfeae67b9debdffb74dea4743
n_exons             | 5
se_i                | 0,253;253,411;411,587;587,673;673,1934
cds_se_i            | 233,253;253,411;411,587;587,673;673,804
starts_i            | {0,253,411,587,673}
ends_i              | {253,411,587,673,1934}
lengths             | {253,158,176,86,1261}
cds_start_i         | 233
cds_end_i           | 804
cds_start_exon      | 0
cds_end_exon        | 4

uta=> select * from uta_20210129b.tx_def_summary_v where tx_def_summary_v.tx_ac = 'NM_001301302.1';
-[ RECORD 1 ]-------+----------------------------------------
exon_set_id         | 347760
tx_ac               | NM_001301302.1
alt_ac              | NM_001301302.1
alt_aln_method      | transcript
alt_strand          | 1
hgnc                | OAZ2
cds_md5             | b37122a9fdad16364189a05ccdec2b12
es_fingerprint      | 741e13a02e0991e512c61ceb6d652c92
cds_es_fp           | b2620ac572e16cb98b51298f39b9fe96
cds_exon_lengths_fp | 9fe058e4b9ff38d9db4bea4f13152963
n_exons             | 5
se_i                | 0,253;253,408;408,584;584,670;670,1931
cds_se_i            | 233,253;253,408;408,584;584,670;670,801
starts_i            | {0,253,408,584,670}
ends_i              | {253,408,584,670,1931}
lengths             | {253,155,176,86,1261}
cds_start_i         | 233
cds_end_i           | 801
cds_start_exon      | 0
cds_end_exon        | 4

@davmlaw
Copy link
Contributor

davmlaw commented Mar 12, 2024

Yes, this is a problem and the correct thing to do would be adjust for it. The GFF3 spec has Column 8: "phase" - which I think you need to use for correct calculations.

I haven't looked into it deeply, but I don't think you can just handle it by altering the start/end coordinates of a CDS (otherwise they would do that in the GFF) but rather you have to take into account this phasing in the HGVS conversion code.

Given that our transcript from GFF3 is the same as the UTA transcript, it looks like they have the same trouble. They probably don't have a way to adjust for the phasing here.

So - I think you need to raise it as a BioCommons HGVS issue - they will probably have to alter the UTA schema as well.

CDOT

We can add the GFF3 phasing score of 0/1/2 to our cdot transcripts, in anticipation for it being fixed in HGVS, so that historical data can be used for a data provider....

The most obvious place would be adding it to a field in "exons" but this would break existing code:

for (alt_start_i, alt_end_i, exon_id, cds_start_i, cds_end_i, gap) in exons:

So we'd have to bump the major version of the data version (note to future self: use an asterisk to collect remaining fields so you can add ones in future, or avoid arrays with positional info and use dicts with keys)

A non-breaking way to add this would be to have a new array that is parallel with exons, eg "phasing" that contains the offsets

@davmlaw
Copy link
Contributor

davmlaw commented Mar 12, 2024

Interested to hear thoughts especially about how we basically can't do anything useful w/o HGVS

I think we should add a "phasing" array to our JSON (if it's not all zeros) so it doesn't break backwards compatibility with exon fields.

Until biocommons add support this is just making our files bigger for no benefit, but at least we'll be correct and maybe tools from other languages etc can use it.

Then, if biocommons HGVS adds support, we can modify our data provider classes to get this into biocommons HGVS

I also made an issue about breaking changes (so we can learn from our mistakes here - about using fixed length arrays where we may want to add info in the future)

@davmlaw
Copy link
Contributor

davmlaw commented Mar 12, 2024

I don't get why phase is 0 in GRCh38?

$ zgrep "NM_015068.3" GCF_000001405.40_GRCh38.p14_genomic.110.gff.gz
NC_000007.14	BestRefSeq	mRNA	94656325	94669695	.	+	.	ID=rna-NM_015068.3;Parent=gene-PEG10;Dbxref=GeneID:23089,Genbank:NM_015068.3,HGNC:HGNC:14005,MIM:609810;Name=NM_015068.3;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 1;transcript_id=NM_015068.3
NC_000007.14	BestRefSeq	exon	94656325	94656580	.	+	.	ID=exon-NM_015068.3-1;Parent=rna-NM_015068.3;Dbxref=GeneID:23089,Genbank:NM_015068.3,HGNC:HGNC:14005,MIM:609810;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 1;transcript_id=NM_015068.3
NC_000007.14	BestRefSeq	exon	94663334	94669695	.	+	.	ID=exon-NM_015068.3-2;Parent=rna-NM_015068.3;Dbxref=GeneID:23089,Genbank:NM_015068.3,HGNC:HGNC:14005,MIM:609810;gbkey=mRNA;gene=PEG10;product=paternally expressed 10%2C transcript variant 1;transcript_id=NM_015068.3
NC_000007.14	BestRefSeq	CDS	94663557	94664513	.	+	0	ID=cds-NP_055883.2;Parent=rna-NM_015068.3;Dbxref=GeneID:23089,Genbank:NP_055883.2,HGNC:HGNC:14005,MIM:609810;Name=NP_055883.2;Note=protein translation is dependent on -1 ribosomal frameshift%3B isoform 1 is encoded by transcript variant 1;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 1;protein_id=NP_055883.2
NC_000007.14	BestRefSeq	CDS	94664513	94665682	.	+	0	ID=cds-NP_055883.2;Parent=rna-NM_015068.3;Dbxref=GeneID:23089,Genbank:NP_055883.2,HGNC:HGNC:14005,MIM:609810;Name=NP_055883.2;Note=protein translation is dependent on -1 ribosomal frameshift%3B isoform 1 is encoded by transcript variant 1;exception=ribosomal slippage;gbkey=CDS;gene=PEG10;product=retrotransposon-derived protein PEG10 isoform 1;protein_id=NP_055883.2

So are we supposed to ignore phase column here and instead get the info from "Note=protein translation is dependent on -1 ribosomal frameshift"

@davmlaw davmlaw changed the title Question about NM_015068.3 CDS phase (offset for eg ribo slippage) Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants