Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENSEMBL ID version conversion #24

Closed
grabear opened this issue Aug 21, 2020 · 9 comments
Closed

ENSEMBL ID version conversion #24

grabear opened this issue Aug 21, 2020 · 9 comments

Comments

@grabear
Copy link

grabear commented Aug 21, 2020

Consider adding in functionality for EnrichmentBrowser::idMap so that it automatically validates/converts ENSEMBL ids from id.version to id (e.g. ENSG00000002919.14 to ENSG00000002919). Try to conserve id.version by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.

gsub("\\..*", "", row.names(ens_table))

Originally posted by @grabearummc in #23 (comment)

@grabear
Copy link
Author

grabear commented Aug 21, 2020

@lgeistlinger Good idea on the new issue.

AnnotationDBI::mapIDs is used in 3 internal functions in mapIds.R, but it looks like it might only be relevant here:

.mapStats <- function(sgenes, org, from, to, multi.to)
{
orgpkg <- .getAnnoPkg(org)
sgenes <- sgenes[!is.na(sgenes)]
sgenes <- sgenes[sgenes != ""]
suppressMessages(
sgenes <- AnnotationDbi::mapIds(orgpkg, keys = sgenes,
column = to,
keytype = from,
multiVals = "list")
)
sgenes <- .resolveMultiTo(sgenes, orgpkg, from, to, multi.to)
nr.na <- sum(is.na(sgenes))
if(nr.na) message(paste("Excluded", nr.na,
"from.IDs without a corresponding to.ID"))
nr.mf <- sum(table(sgenes) > 1)
if(nr.mf) message(paste("Encountered", nr.mf, "to.IDs with >1 from.ID"))
return(sgenes)
}

.idmap <- function(ids, anno, from, to,
excl.na=TRUE, multi.to="first", resolve.multiFrom=TRUE)
{
anno.pkg <- .getAnnoPkg(anno)
suppressMessages(
x <- AnnotationDbi::mapIds(anno.pkg,
keys=ids, keytype=from, column=to, multiVals="list")
)
# case 1: multiple to.IDs (1:n) -> select one
x <- .resolveMultiTo(x, anno.pkg, from, to, multi.to)
# case 2: no to.ID -> exclude
if(excl.na) x <- .exclNaIds(x)
# case 3: multiple from.IDs (n:1) -> select one
if(resolve.multiFrom) x <- .getFirstToId(x)
return(x)
}

@grabear
Copy link
Author

grabear commented Aug 21, 2020

@lgeistlinger

grabearummc@dbe316a

Here's my fix. If you are happy with it, then I will create a PR. Other solutions might involve:

  • detecting ENSEMBL ids with version info and then making a change.
  • same as the previous, but also conserving the original ids in another column in the original object.
    • for SE objects that might look like this in the idMap function:
    nrowData(SE)[["ENSEMBL.id"]] <- names(SE)
    names(SE) <- gsub("\\..*", "", names(SE)) 

@lgeistlinger
Copy link
Owner

Thanks. Can you provide an example where the mapping results in such versioned ENSEMBL gene ids? If that's caused by outdated mappings in the corresponding org.db package, then it is worth fixing it directly there instead of working around it downstream.

@grabearummc
Copy link

I was removing the ENSEMBL versioning information in my commit before doing the mapping with AnnotationDBI. AnnotationDBI::mapIds will break if your keys/ids are ENSEMBL and have versioning.

I don't think that the org.db packages use the versioning information (which is the issue), but I could be wrong. Is that what you mean?

For me, the version info is introduced way before my R pipeline. For this instance specifically, I was using salmon/gencode for quantification.

@lgeistlinger
Copy link
Owner

I see we are talking here about providing versioned IDs to the ID mapping. Well, although I can see that this might be handy to have, I think in this case, it's best to leave it up to the user to provide valid (here: unversioned) gene IDs that are compatible with mapping via AnnotationDBI::mapIds. Good thing is, here it seems to be just a gsub command to have the IDs ready for the mapping.

@grabearummc
Copy link

Ok, thanks for the response @lgeistlinger. When I have some extra time, I will get some feedback from the AnnotationDBI repository, and link back to this issue.

@lgeistlinger
Copy link
Owner

It might be even worth understanding why your GENCODE reference would include versioned gene IDs in the first place?

@grabearummc
Copy link

You got me curious @lgeistlinger . I definitely had to google some of this so let me know if you have some insight.

ENSEMBL ids contain a version (ENS***.Version), so that when things change......

  • Genes: increments when the set of transcripts linked to a gene changes
  • Transcripts: increments when there is a change in a transcript's splicing pattern, chromosome location or a sequence change in the cDNA
  • Proteins: increments when there is a sequence change in the peptide sequence
  • Exons: increments when there is a sequence change in the exon genomic sequence

......the older references can be preserved.
https://m.ensembl.org/Help/Faq?id=488
http://uswest.ensembl.org/info/genome/stable_ids/index.html

GENCODE is a project to create super accurate mouse/human genetic data from ENSEMBL. So they should have the versioning info.
http://uswest.ensembl.org/Help/Faq?id=303
https://www.gencodegenes.org/pages/faq.html

My question is why doesn't the OrgDbs contain the versioning information? Is it just because OrgDbs primarily map to the Entrez Ids?

@lgeistlinger
Copy link
Owner

lgeistlinger commented Aug 24, 2020

I think it reflects the scope of the two different applications (read mapping vs gene ID mapping).

For read mapping, different versions of a gene ID can result in updates to the genomic coordinates / chromosomal location of the gene (eg when a novel transcript is annotated to the gene). This, in turn, can result also in a different read count for that gene, with eg more reads falling onto the updated coordinates.

For gene ID mapping, however, the version does not matter, as, when eg mapping from ENSEMBL IDs to gene symbols, ENSG00000002919 maps to SNX11, and thus so does ENSG00000002919.1, ENSG00000002919.2, ..., ENSG00000002919.14. Therefore AnnotationDbi also doesn't care about the versions. At least this is how I understand it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants