Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing DGE data into EnrichmentBrowser #23

Closed
grabearummc opened this issue Aug 21, 2020 · 14 comments
Closed

Importing DGE data into EnrichmentBrowser #23

grabearummc opened this issue Aug 21, 2020 · 14 comments
Labels

Comments

@grabearummc
Copy link

Hi guys. I'm starting to work with EnrichmentBrowser, and I'm running into some issues. I'm looking at porting in DGE data from different sources (DESeq2, limma, and edgeR). My biggest problem is getting these objects into the proper format (SummarizedExperiment).

After exploring this repo some more, I noticed your changelog has a bullet for new import functions for deseq/limma/edgeR. This would be super helpful, so I don't have to write my own functions. So, how stable are they right now? Is the master branch "ok" to use in its current form?

@lgeistlinger
Copy link
Owner

lgeistlinger commented Aug 21, 2020

Thanks for contacting us. I'd say you are good to go with the master branch, and you'll find documentation and working examples when using ?import after installation. See also the documentation of the import function here: https://bioconductor.org/packages/devel/bioc/manuals/EnrichmentBrowser/man/EnrichmentBrowser.pdf

I conducted a number of tests that were looking good, but you would be def among the first users, so I'd be interested in how this works for you and whether we need to extend the functionality further.

@grabearummc
Copy link
Author

Alright, thank you that's great! I'll use this issue as a forum to discuss.

@grabearummc
Copy link
Author

grabearummc commented Aug 21, 2020

Consider adding in functionality for EnrichmentBrowser::idMap so that it automatically validates/converts ENSEMBL ids from id.version to id (e.g. ENSG00000002919.14 to ENSG00000002919). Try to conserve id.version by adding another column to rowData. This is really more of an issue with AnnotationDBI, but it couldn't hurt.

gsub("\\..*", "", row.names(ens_table))

@lgeistlinger
Copy link
Owner

lgeistlinger commented Aug 21, 2020

let's make this another issue ("ENSEMBL ID version conversion"), and let's use the current issue for focusing on functionality of EnrichmentBrowser::import -- renaming the issue for clarity

@lgeistlinger lgeistlinger changed the title How stable is the master branch? Importing DGE data into EnrichmentBrowser Aug 21, 2020
@vivek-verma202
Copy link

@lgeistlinger , thank-you for the update! (esp. import).

# I used apeglm for the res:
res <- lfcShrink(dds, coef="FM_1_vs_0", type="apeglm")
se <- import(dds, res, from = "DESeq2")
Error in .importFromDESeq2(obj, res) : 
  all(rnames %in% colnames(res)) is not TRUE
colnames(res)
[1] "baseMean"       "log2FoldChange" "lfcSE"         
[4] "pvalue"         "padj"  

# repeated without "apeglm"
res <- results(dds)
se <- import(dds, res, from = "DESeq2")
Error in .importFromDESeq2(obj, res) : 
Supported experimental designs include binary group comparisons 
with an optional blocking variable for paired samples / sample batches

design(dds)
~ age + batch + condition

I have 4 questions for you:

  1. Is LFC shrinkage needed / recommended with EnrichmentBrowser?

  2. I have to use age (scaled, continuous) and batch (binary) as covariates to analyze for my condition of interest (binary)? How can I use this information with EnrichmentBrowser to avoid any false positives?

  3. To circumvent covariate problem, I was thinking of using ranked gene list with pi scores res$pi <- res$log2FoldChange*(-log(res$pvalue)), can I do this for downstream topology-based methods in EnrichmentBrowser, if yes, how?

  4. Could you suggest a better way to rank / score genes? I am not sure how should the tie broken in case of non-unique scores.

@lgeistlinger
Copy link
Owner

Thanks for testing + raising these issues @vivek-verma202!
I am currently sprinting towards a deadline for a grant proposal on Sunday, but I am looking into this + will get back to you on Monday.

@vivek-verma202
Copy link

Thanks, good-luck with your grant application!

@grabearummc
Copy link
Author

Alright here is an update for you @lgeistlinger.

I've been using import and I have a few suggestions. For importing limma data I think there needs to be a way for using data from the limma-trend approach. Check page 72 on the limma user guide.

Instead of using voom:

#' # (3) import from voom/limma (RNA-seq count data)
#' # (3a) create the expression data object
#' library(limma)
#' keep <- filterByExpr(counts, rdesign)
#' el <- voom(counts[keep,], rdesign)
#'
#' # (3b) obtain differential expression results
#' fit <- lmFit(el, rdesign)
#' fit <- eBayes(fit, robust = TRUE)
#' res <- topTable(fit, coef = 2, number = nrow(counts), sort.by = "none")
#'
#' # (3c) import
#' se <- import(el, res)

You would use the limma-trend method:

#'   # (5) import from trend/limma (RNA-seq count data)
#'   # (5a) create the expression data object
#'   library(limma)
#'   keep <- filterByExpr(counts, rdesign)
#'   
> logCPM <- edgeR::cpm(counts[keep,], log=TRUE, prior.count=3)
#'   # (5b) obtain differential expression results 
> fit <- lmFit(logCPM, design)
> fit <- eBayes(fit, trend=TRUE)
> res <- topTable(fit, coef=ncol(design))

#'
#'   # (5c) import???
#'   se <- import(el, res)

@vivek-verma202
Copy link

Hi @lgeistlinger, hope your grant application went well.

I was wondering if you had a chance to look into the "limitations" of current import()?
As far as my question 1 is concerned, some reading made me realize that LFC shrinkage (apeglm) is not necessary esp. if I have already filtered out poorly expressed genes. Hence, was planning to ditch it (unless you've something to add to it).

Hope to hear from you, soon.

@lgeistlinger
Copy link
Owner

lgeistlinger commented Sep 1, 2020

Hi @vivek-verma202,

Thanks for your patience.

The application has been a sprint involving daily 12 h shifts for the last 7 days
straight - but I am pretty happy with the product. Thanks for asking.

Concerning your questions:

  1. lfcShrinkage:
    According to the DESeq2 vignette only relevant for visualization and ranking of genes when having not filtered on not sufficiently expressed genes first.
    In the EnrichmentBrowser, I typically use edgeR::filterByExpr (but you can easily apply customized thresholds as well) for that as the very first step of processing of the count matrix, ie before normalization, DE analysis, and GS analysis.
    Thus: if you've filtered before - yes, shrinkage is not needed.
    That said, I think the error that you encounter in your first post is mainly a
    result of your result object (res) not having the same number of rows (genes) as your dds. I suspect rows/genes been removed by lfcShrink?

  2. covariates:
    design ~ batch + condition should work right out of the box (can you confirm?).
    Supported experimental designs currently only include binary group comparisons with an optional blocking variable for paired samples / sample batches.
    Adding additional covariates is currently not supported and I would take this as a feature request and will invest some time to implement that.
    Note, however that only regression GS methods such as camera and roast support extended experimental designs and are able to make use of this information.
    Something that you could do right away:

design ~ age + condition if you would form here a number of discrete age groups.

(inspecting resulting DE measures against design ~ condition only would btw
give you an indication whether there are indeed age-specific effects).

  1. Downstream topology-based methods:

If you are interested in those, or in general methods that only work on the DE
results, such as eg also ora, then you can actually ignore my remarks on
experimental design above and

i) make use of the sig.stat argument that can be supplied to EnrichmentBrowser::sbea
and EnrichmentBrowser::nbea.

• beta: Log2 fold change significance level. Defaults to 1 (2-fold).

• sig.stat: decides which statistic is used for determining
significant DE genes. Options are:

        • 'p' (Default): genes with adjusted p-value below
              alpha.

        • 'fc': genes with abs(log2(fold change)) above beta

        • '&': p & fc (logical AND)

        • '|': p | fc (logical OR)

        • 'xxp': top xx % of genes sorted by adjusted p-value

        • 'xxfc' top xx % of genes sorted by absolute log2 fold
                 change.

Example: if you wanted to conduct a SPIA analysis with genes rendered differentially expressed (DE genes) that have an absolute log2 fold change > 1 and an adjusted p-value < 0.1, you would call:

spia.res <- nbea("spia", se, gs, grn, alpha = 0.1, beta = 1, sig.stat = '&')

ii) If not otherwise specified (eg via the sig.stat argument), the default point
of reference for the EnrichmentBrowser functions when it comes to gene DE measures is the ADJ.PVAL column in the rowData slot of your SummarizedExperiment (ie. your dds after using import).

se <- import(dds, res)

You can always override / abuse this column by eg setting it

rowData(se)[my.DE.genes.of.interest, "ADJ.PVAL"] <- 0.04
rowData(se)[notDE.notInteresting.genes, "ADJ.PVAL"] <- 0.06

as per default, sbea and nbea methods that work only on a list of DE genes
will take genes with ADJ.PVAL < alpha (default alpha=0.05).

Note the: xxp or xxfc options of the sig.stat argument above are
an alternative to that, allowing to select a certain percentage of genes
that you find interesting based on log2 fold change or p-value (and can
also be accordingly tweaked as above).

  1. When it comes to ranking, the raw p-value typically has better granularity than the adjusted p-value.
    Accordingly:

    res <- DESeq2::results(dds)
    sorting.df <- res[,c("pvalue","log2FoldChange")]
    sorting.df[,2] <- -abs(sorting.df[,2])
    ind <- do.call(order, as.data.frame(sorting.df)))
    

    would give you a very fine-grained order by nominal p-value, and for genes
    with equal nominal p-value it would sort by absolute fold change in decreasing order.
    You could eg use that ordering after importing via:

    se <- se[ind,] 
    

    (if that still doesn't result in a satisfying ranking, you could incorporate
    additional measures such as the testing statistic by modifying the above line to:

    sorting.df <- res[,c("pvalue","log2FoldChange", "stat")]
    

Hope that helps and don't hesitate to inquire further if some things are unclear.

@vivek-verma202
Copy link

Thank you so much, @lgeistlinger !
Here are the updates:

  1. design = ~ batch + condition worked!
    (Results were not very different with or without age, thanks to you, age is no more covariate of interest in my analysis.)

  2. I could not proceed without normalization:

res <- sbea(method = "gsea", se = se, gs = go.bp.gs, browse = T)
Error in .reorderAssays(se, assay) : 
  Expression dataset (se) does not contain an assay named "norm"

Normalization worked but excluded ~91 % of genes:

se <- normalize(se, norm.method = "vst")
Excluding 8118 genes not satisfying min.cpm threshold

The results were from DESeq2 and not from raw read counts. Considering the vignette, I am not sure if normalization is needed. Also, DESeq2 output already has normalization factors:

image

How can I make use of DESeq2's normalization factors to normalize the count slot and create "norm" slot in an SE?
(I think, it's an important consideration for the import() function.)

@vivek-verma202
Copy link

I came with a work around, not sure if it's correct, please let me know opinion:

se@assays@data@listData[["norm"]] <- se@assays@data@listData[["counts"]]/se@assays@data@listData[["normalizationFactors"]]

Looks that counts were normalized:
hist(log(se@assays@data@listData[["counts"]]))
image
hist(log(se@assays@data@listData[["counts"]]))
image

@lgeistlinger
Copy link
Owner

lgeistlinger commented Sep 1, 2020

Hi @vivek-verma202 -

se <- normalize(se, norm.method = "vst")

is the right thing todo here. I'll add an argument to normalize to switch off filtering.

There are some helpful accessor functions such as assay, colData, and rowData for a SummarizedExperiment.
See also the graphical overview of a SummarizedExperiment and how to access its main parts here.

A fast approximation of what EnrichmentBrowser::normalize() with method = "vst" does:

assay(se, "norm") <- edgeR::cpm(assay(se, "counts"))

But I agree, that's something that import should support. I'll make a dedicted issue.

@lgeistlinger
Copy link
Owner

Closing this and will continue working on the specific issues derived from the discussion here (#28 and #29). Feel free to reopen + thanks a lot for testing and feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants