Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript aliases - HGVS using gene names (canonical transcripts) or LRG #36

Open
davmlaw opened this issue Mar 6, 2023 · 2 comments

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Mar 6, 2023

Often someone will want to resolve a gene symbol, ie: BRCA2:c.36del

Data providers could take an argument eg resolve_genes_with_canonical_transcripts=True which would look up the MANE select transcript, then substitute it for eg NM_000059.4:c.36del

The same goes with LRG, eg Variant Validator gives "LRG_1:g.8638G>T automapped to equivalent RefSeq record
NG_007400.1:g.8638G>T"

PROBLEM

The way the biocommons HGVS code current works (and our data provider) is to retrieve the transcript (build independent) then get coordinates out later. So switching the transcripts out before knowing what build it is could cause troubles, as eg Eg NUDT4B - has transcript NM_001355407 which is MANE select (v2) but no present at all in 37

Example: OR8G1 has NM_001002905.1 as the "37 RefSeq select" and NM_001002905.2 as "38 MANE Select" (NM_001002905.2 does not exist in 37) - this is pretty rare though (out of 19k MANE transcripts - 20 don't have anything for 37, 22 don't have the version)

This would be seemingly straight forward to do inside the data provider by just switching out transcripts using an aliases (eg LRG or gene names for canonical transcript, eg:

# cdot.hgvs.dataproviders.json_data_provider.JSONDataProvider

self.aliases = {
    "LRG_1t1": "NM_000088.3",
}

    def _resolve_transcript_alias(self, tx_ac):
        """ Returns an alias if we have one, otherwise tx_ac unmodified """
        return self.aliases.get(tx_ac, tx_ac)

    def _get_transcript(self, tx_ac):
        tx_ac = self._resolve_transcript_alias(tx_ac)
        return self.transcripts.get(tx_ac)

    def get_seq(self, ac, start_i=None, end_i=None):
        ac = self._resolve_transcript_alias(ac)
        return super().get_seq(ac, start_i=start_i, end_i=end_i)

LRG

wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

Records look like:

9606	1277	COL1A1	NG_007400.1	LRG_1	NM_000088.3	t1	NP_000079.2	p1	reference standard

Seemingly build independent

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 6, 2023

You could possibly do it this way then just throw an exception the few times it fails (ie returns a MANE transcript that isn't in 37 - note only if you have a data file with both genome builds too) the downside is there is no way to return a message, eg that you substituted out a transcript in there, so this may be slightly dodgy

@davmlaw
Copy link
Contributor Author

davmlaw commented Mar 6, 2023

Few thoughts while looking through code:

Interval tree and gene stuff should be separated as interval takes way longer. Can move the canonical into that loop too (or just always do it)

You may want to indicate a sorting mechanism to pick what gets swapped out eg refseq vs ensembl etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant