ENSEMBL ID version conversion #24

grabear · 2020-08-21T16:26:27Z

Consider adding in functionality for EnrichmentBrowser::idMap so that it automatically validates/converts ENSEMBL ids from id.version to id (e.g. ENSG00000002919.14 to ENSG00000002919). Try to conserve id.version by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.

gsub("\\..*", "", row.names(ens_table))

Originally posted by @grabearummc in #23 (comment)

The text was updated successfully, but these errors were encountered:

grabear · 2020-08-21T16:34:24Z

@lgeistlinger Good idea on the new issue.

AnnotationDBI::mapIDs is used in 3 internal functions in mapIds.R, but it looks like it might only be relevant here:

EnrichmentBrowser/R/mapIds.R

Lines 201 to 219 in 4357b80

    
           .mapStats <- function(sgenes, org, from, to, multi.to) 
        
           { 
        
               orgpkg <- .getAnnoPkg(org) 
        
               sgenes <- sgenes[!is.na(sgenes)] 
        
               sgenes <- sgenes[sgenes != ""] 
        
               suppressMessages( 
        
                   sgenes <- AnnotationDbi::mapIds(orgpkg, keys = sgenes,  
        
                                                           column = to,  
        
                                                           keytype = from,  
        
                                                           multiVals = "list") 
        
               ) 
        
               sgenes <- .resolveMultiTo(sgenes, orgpkg, from, to, multi.to) 
        
               nr.na <- sum(is.na(sgenes)) 
        
           	if(nr.na) message(paste("Excluded", nr.na,  
        
                                       "from.IDs without a corresponding to.ID")) 
        
               nr.mf <- sum(table(sgenes) > 1) 
        
               if(nr.mf) message(paste("Encountered", nr.mf, "to.IDs with >1 from.ID")) 
        
               return(sgenes) 
        
           }

EnrichmentBrowser/R/mapIds.R

Lines 272 to 290 in 4357b80

    
           .idmap <- function(ids, anno, from, to,  
        
               excl.na=TRUE, multi.to="first", resolve.multiFrom=TRUE) 
        
           { 
        
               anno.pkg <- .getAnnoPkg(anno)  
        
               suppressMessages( 
        
                   x <- AnnotationDbi::mapIds(anno.pkg,  
        
                           keys=ids, keytype=from, column=to, multiVals="list") 
        
               ) 
        
           	# case 1: multiple to.IDs (1:n) -> select one 
        
               x <- .resolveMultiTo(x, anno.pkg, from, to, multi.to) 
        
               # case 2: no to.ID -> exclude 
        
               if(excl.na) x <- .exclNaIds(x) 
        
           	# case 3: multiple from.IDs (n:1) -> select one 
        
               if(resolve.multiFrom) x <- .getFirstToId(x)	     
        
               return(x) 
        
           }

grabear · 2020-08-21T17:00:31Z

@lgeistlinger

grabearummc@dbe316a

Here's my fix. If you are happy with it, then I will create a PR. Other solutions might involve:

detecting ENSEMBL ids with version info and then making a change.
same as the previous, but also conserving the original ids in another column in the original object.
- for SE objects that might look like this in the idMap function:
```
nrowData(SE)[["ENSEMBL.id"]] <- names(SE)
names(SE) <- gsub("\\..*", "", names(SE)) 
```

lgeistlinger · 2020-08-21T23:33:27Z

Thanks. Can you provide an example where the mapping results in such versioned ENSEMBL gene ids? If that's caused by outdated mappings in the corresponding org.db package, then it is worth fixing it directly there instead of working around it downstream.

grabearummc · 2020-08-22T00:26:08Z

I was removing the ENSEMBL versioning information in my commit before doing the mapping with AnnotationDBI. AnnotationDBI::mapIds will break if your keys/ids are ENSEMBL and have versioning.

I don't think that the org.db packages use the versioning information (which is the issue), but I could be wrong. Is that what you mean?

For me, the version info is introduced way before my R pipeline. For this instance specifically, I was using salmon/gencode for quantification.

lgeistlinger · 2020-08-22T01:43:46Z

I see we are talking here about providing versioned IDs to the ID mapping. Well, although I can see that this might be handy to have, I think in this case, it's best to leave it up to the user to provide valid (here: unversioned) gene IDs that are compatible with mapping via AnnotationDBI::mapIds. Good thing is, here it seems to be just a gsub command to have the IDs ready for the mapping.

grabearummc · 2020-08-24T17:08:52Z

Ok, thanks for the response @lgeistlinger. When I have some extra time, I will get some feedback from the AnnotationDBI repository, and link back to this issue.

lgeistlinger · 2020-08-24T17:11:10Z

It might be even worth understanding why your GENCODE reference would include versioned gene IDs in the first place?

grabearummc · 2020-08-24T21:44:35Z

You got me curious @lgeistlinger . I definitely had to google some of this so let me know if you have some insight.

ENSEMBL ids contain a version (ENS***.Version), so that when things change......

Genes: increments when the set of transcripts linked to a gene changes
Transcripts: increments when there is a change in a transcript's splicing pattern, chromosome location or a sequence change in the cDNA
Proteins: increments when there is a sequence change in the peptide sequence
Exons: increments when there is a sequence change in the exon genomic sequence

......the older references can be preserved.
https://m.ensembl.org/Help/Faq?id=488
http://uswest.ensembl.org/info/genome/stable_ids/index.html

GENCODE is a project to create super accurate mouse/human genetic data from ENSEMBL. So they should have the versioning info.
http://uswest.ensembl.org/Help/Faq?id=303
https://www.gencodegenes.org/pages/faq.html

My question is why doesn't the OrgDbs contain the versioning information? Is it just because OrgDbs primarily map to the Entrez Ids?

lgeistlinger · 2020-08-24T23:55:53Z

I think it reflects the scope of the two different applications (read mapping vs gene ID mapping).

For read mapping, different versions of a gene ID can result in updates to the genomic coordinates / chromosomal location of the gene (eg when a novel transcript is annotated to the gene). This, in turn, can result also in a different read count for that gene, with eg more reads falling onto the updated coordinates.

For gene ID mapping, however, the version does not matter, as, when eg mapping from ENSEMBL IDs to gene symbols, ENSG00000002919 maps to SNX11, and thus so does ENSG00000002919.1, ENSG00000002919.2, ..., ENSG00000002919.14. Therefore AnnotationDbi also doesn't care about the versions. At least this is how I understand it.

lgeistlinger closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENSEMBL ID version conversion #24

ENSEMBL ID version conversion #24

grabear commented Aug 21, 2020

grabear commented Aug 21, 2020

grabear commented Aug 21, 2020 •

edited

Loading

lgeistlinger commented Aug 21, 2020

grabearummc commented Aug 22, 2020

lgeistlinger commented Aug 22, 2020

grabearummc commented Aug 24, 2020

lgeistlinger commented Aug 24, 2020

grabearummc commented Aug 24, 2020

lgeistlinger commented Aug 24, 2020 •

edited

Loading

ENSEMBL ID version conversion #24

ENSEMBL ID version conversion #24

Comments

grabear commented Aug 21, 2020

grabear commented Aug 21, 2020

grabear commented Aug 21, 2020 • edited Loading

lgeistlinger commented Aug 21, 2020

grabearummc commented Aug 22, 2020

lgeistlinger commented Aug 22, 2020

grabearummc commented Aug 24, 2020

lgeistlinger commented Aug 24, 2020

grabearummc commented Aug 24, 2020

lgeistlinger commented Aug 24, 2020 • edited Loading

grabear commented Aug 21, 2020 •

edited

Loading

lgeistlinger commented Aug 24, 2020 •

edited

Loading