Skip to content

Characterize gene dynamics over trajectories using GLMs, GEEs, & GLMMs.

License

Notifications You must be signed in to change notification settings

wendelljpereira/scLANE

 
 

Repository files navigation

scLANE

codecov R-CMD-check License: MIT release

The scLANE package enables users to accurately determine differential expression of genes over pseudotime or latent time, and to characterize gene’s dynamics using interpretable model coefficients. scLANE builds upon the marge modeling framework, allowing users to characterize their trajectory’s effects on mRNA expression using negative binomial GLMs, GEEs, or GLMMs, depending on the experimental design & biological questions of interest.

A quickstart guide on how to use scLANE with simulated data continues below, and a more detailed vignette showcasing its performance on real data can be found here.

Installation

You can install the most recent version of scLANE with:

remotes::install_github("jr-leary7/scLANE")

Usage

Our method relies on a relatively simple test in order to define whether a given gene is differentially expressed (or “dynamic”) over the provided trajectory. While the exact structure of the test differs by model backend, the concept is the same: the spline-based NB GLM / GEE / GLMM is treated as the alternate model, and a null model is fit using the corresponding model backend. If the GLM backend is used, then the null model is simply an intercept-only NB GLM; the GEE backend fits an intercept-only model with the same working correlation structure as the alternate model, and if the GLMM backend is used then the null model is an intercept-only model with random intercepts for each subject. The alternate hypothesis is thus that at least one of the estimated coefficients is significantly different from zero. We predict a given gene to be dynamic if the adjusted p-value of the test is less than an a priori threshold; the default threshold is 0.01, and the default adjustment method is the Holm correction.

Input Data

scLANE has a few basic inputs with standardized formats, though different modeling frameworks require different inputs; for example, the GEE & GLMM backends require a vector of subject IDs corresponding to each cell. We’ll start by simulating some scRNA-seq counts data with a subset of genes being dynamic over a trajectory using the scaffold R package. As part of our simulation studies I wrote the function sourced below to generate multi-subject scRNA-seq datasets with trajectories partially conserved between subjects.

source("https://raw.githubusercontent.com/jr-leary7/scLANE-Sims/main/R/functions_simulation.R")

We’ll also need to load a couple dependencies & resolve a function conflict.

library(dplyr)
library(scLANE)
library(ggplot2)
library(scaffold)
library(SingleCellExperiment)
select <- dplyr::select

We’ll use the scRNAseq R package to pull the human pancreas data from Baron et al 2017 to serve as our reference dataset. We then simulate a dataset with 3 subjects, 1200 cells allocated equally between subjects, and 10% of the genes dynamic over the simulated trajectory. We label a gene dynamic at the population level if (in this case) it’s dynamic in 2/3 subjects.

set.seed(312)
panc <- scRNAseq::BaronPancreasData(which = "human")
sim_data <- simulate_multi_subject(ref.dataset = panc, 
                                   perc.dyn.genes = 0.1, 
                                   n.cells = 1200, 
                                   perc.allocation = rep(1/3, 3), 
                                   n.subjects = 3, 
                                   gene.dyn.threshold = 2)

The PCA embeddings show us a pretty simple trajectory that’s strongly correlated with the first principal component, and we can also see that the trajectory is conserved cleanly across subjects.

plotPCA(sim_data, colour_by = "subject")

plotPCA(sim_data, colour_by = "cell_time_normed")

Differential Expression Testing

Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We’ll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dense matrix with cells as rows & genes as columns (i.e., transposed from the way it’s stored in SingleCellExperiment & Seurat objects), a dataframe containing our cell ordering, and a set of genes to build models for. In reality, it’s usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we’ll select 50 genes each from the dynamic and non-dynamic populations. Note: in this case we’re working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to testDynamic().

set.seed(312)
gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50), 
                 sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50))
counts_mat <- as.matrix(t(counts(sim_data)[gene_sample, ]))
order_df <- data.frame(X = sim_data$cell_time_normed)

GLM Backend

Running testDynamic() provides us with a nested list containing model output & DE test results for each gene over each pseudotime / latent time lineage. In this case, since we have a true cell ordering we only have one lineage. Parallel processing is turned on by default, and we use 2 cores here to speed up runtime.

de_test_glm <- testDynamic(expr.mat = counts_mat, 
                           pt = order_df, 
                           genes = gene_sample, 
                           n.potential.basis.fns = 3, 
                           n.cores = 2, 
                           track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 13.941 secs"

After the function finishes running, we use getResultsDE() to generate a sorted table of DE test results, with one row for each gene & lineage. The GLM backend uses a simple likelihood ratio test to compare the null & alternate models, with the test statistic assumed to be asymptotically Chi-squared.

de_res_glm <- getResultsDE(de_test_glm)
select(de_res_glm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
  slice_head(n = 5) %>% 
  knitr::kable(format = "pipe", 
               digits = 3, 
               col.names = c("Gene", "Lineage", "LRT Stat.", "P-value", "Adj. P-value", "Predicted Dynamic Status"))
Gene Lineage LRT Stat. P-value Adj. P-value Predicted Dynamic Status
WAPAL A 261.890 0 0 1
JARID2 A 187.261 0 0 1
CBX6 A 168.482 0 0 1
ISOC2 A 156.899 0 0 1
IDH3G A 133.469 0 0 1

After creating a reference table of the ground truth status for each gene - 1 denotes a dynamic gene and 0 a non-dynamic one - and adding that binary indicator the the DE results table, we can generate some classification metrics using a confusion matrix.

gene_status_df <- data.frame(gene = gene_sample, 
                             True_Gene_Status = ifelse(rowData(sim_data)[gene_sample, ]$geneStatus_overall == "Dynamic", 1, 0))
de_res_glm <- inner_join(de_res_glm, 
                         gene_status_df, 
                         by = c("Gene" = "gene"))
caret::confusionMatrix(factor(de_res_glm$Gene_Dynamic_Overall, levels = c(0, 1)), 
                       factor(de_res_glm$True_Gene_Status, levels = c(0, 1)), 
                       positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 50  8
#>          1  0 42
#>                                           
#>                Accuracy : 0.92            
#>                  95% CI : (0.8484, 0.9648)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.84            
#>                                           
#>  Mcnemar's Test P-Value : 0.01333         
#>                                           
#>             Sensitivity : 0.8400          
#>             Specificity : 1.0000          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.8621          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4200          
#>    Detection Prevalence : 0.4200          
#>       Balanced Accuracy : 0.9200          
#>                                           
#>        'Positive' Class : 1               
#> 

GEE Backend

The function call is essentially the same when using the GLM backend, with the exception of needing to provide a sorted vector of subject IDs & a desired correlation structure. We also need to flip the is.gee flag in order to indicate that we’d like to fit estimating equations models (instead of mixed models). Since fitting GEEs is a fair bit more complex than fitting GLMs, DE testing with the GEE backend takes a bit longer. Using more cores and / or running the tests on an HPC cluster (as this document is being compiled on a 2020 MacBook Pro) speeds things up considerably.

de_test_gee <- testDynamic(expr.mat = counts_mat, 
                           pt = order_df, 
                           genes = gene_sample, 
                           n.potential.basis.fns = 3, 
                           is.gee = TRUE, 
                           id.vec = sim_data$subject, 
                           cor.structure = "exchangeable", 
                           n.cores = 2, 
                           track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 4.509 mins"

We again generate the table of DE test results. The variance of the estimated coefficients is determined using the sandwich estimator, and a Wald test is used to compare the null & alternate models.

de_res_gee <- getResultsDE(de_test_gee)
select(de_res_gee, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
  slice_head(n = 5) %>% 
  knitr::kable("pipe", 
               digits = 3, 
               col.names = c("Gene", "Lineage", "Wald Stat.", "P-value", "Adj. P-value", "Predicted Dynamic Status"))
Gene Lineage Wald Stat. P-value Adj. P-value Predicted Dynamic Status
JARID2 A 261.382 0 0 1
RAB1B A 620.454 0 0 1
IDH3G A 281.062 0 0 1
MFSD2B A 2330.770 0 0 1
PFDN2 A 605.955 0 0 1

We create the same confusion matrix for the GEEs as we did before. Empirically speaking, when the true trend doesn’t differ much between subjects, the GEEs tend to be more conservative (and thus they perform slightly worse) than the GLMs. This is shown below, where the GEEs perform fairly well, but the false negative rate is higher than that of the GLMs.

de_res_gee <- inner_join(de_res_gee, 
                         gene_status_df, 
                         by = c("Gene" = "gene"))
caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1)), 
                       factor(de_res_gee$True_Gene_Status, levels = c(0, 1)), 
                       positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 50 23
#>          1  0 27
#>                                           
#>                Accuracy : 0.77            
#>                  95% CI : (0.6751, 0.8483)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : 2.757e-08       
#>                                           
#>                   Kappa : 0.54            
#>                                           
#>  Mcnemar's Test P-Value : 4.490e-06       
#>                                           
#>             Sensitivity : 0.5400          
#>             Specificity : 1.0000          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.6849          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.2700          
#>    Detection Prevalence : 0.2700          
#>       Balanced Accuracy : 0.7700          
#>                                           
#>        'Positive' Class : 1               
#> 

GLMM Backend

We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you’re most interested in how a trajectory differs between subjects e.g., if the subjects can be stratified by groups such as Treatment & Control and you expect the Treatment group to have a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the is.glmm flag to TRUE and no longer need to specify a working correlation structure. Note: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly.

de_test_glmm <- testDynamic(expr.mat = counts_mat, 
                            pt = order_df, 
                            genes = gene_sample, 
                            n.potential.basis.fns = 3, 
                            is.glmm = TRUE, 
                            id.vec = sim_data$subject, 
                            n.cores = 2, 
                            track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 5.648 mins"

Like the GLM backend, the GLMMs use a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood instead of REML in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject.

de_res_glmm <- getResultsDE(de_test_glmm)
select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Overall) %>% 
  slice_head(n = 5) %>% 
  knitr::kable("pipe", 
               digits = 3, 
               col.names = c("Gene", "Lineage", "LRT Stat.", "P-value", "Adj. P-value", "Predicted Dynamic Status"))
Gene Lineage LRT Stat. P-value Adj. P-value Predicted Dynamic Status
LY6G5C A 7143.734 0 0 1
MPG A 380.186 0 0 1
WAPAL A 360.886 0 0 1
FLOT2 A 281.259 0 0 1
SPCS3 A 270.637 0 0 1

The GLMMs perform better than the GEEs, and almost as well as the GLMs. Like with the GEEs, it’s more appropriate to use these more complex models when you expect expression dynamics to differ between subjects. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best.

de_res_glmm <- inner_join(de_res_glmm, 
                          gene_status_df, 
                          by = c("Gene" = "gene"))
caret::confusionMatrix(factor(de_res_glmm$Gene_Dynamic_Overall, levels = c(0, 1)), 
                       factor(de_res_glmm$True_Gene_Status, levels = c(0, 1)), 
                       positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 49 12
#>          1  1 38
#>                                          
#>                Accuracy : 0.87           
#>                  95% CI : (0.788, 0.9289)
#>     No Information Rate : 0.5            
#>     P-Value [Acc > NIR] : 6.565e-15      
#>                                          
#>                   Kappa : 0.74           
#>                                          
#>  Mcnemar's Test P-Value : 0.005546       
#>                                          
#>             Sensitivity : 0.7600         
#>             Specificity : 0.9800         
#>          Pos Pred Value : 0.9744         
#>          Neg Pred Value : 0.8033         
#>              Prevalence : 0.5000         
#>          Detection Rate : 0.3800         
#>    Detection Prevalence : 0.3900         
#>       Balanced Accuracy : 0.8700         
#>                                          
#>        'Positive' Class : 1              
#> 

Downstream Analysis & Visualization

The main visualization function is plotModels(). It takes as input the results from testDynamic(), as well as a few specifications for which models & lineages should be plotted. While more complex visualizations can be created from our model output, this function gives us a good first glance at which models fit the underlying trend the best. Here we show the output generated using the GLM backend, split by model type & colored by pseudotime lineage:

plotModels(de_test_glm, 
           gene = gene_sample[4], 
           pt = order_df, 
           gene.counts = counts_mat)

When plotting the models generated using the GLMM backend, we split by lineage & color the points by subject ID instead of by lineage. The GLMMs perform well here since the gene’s dynamics differ somewhat between subjects.

plotModels(de_test_glmm, 
           gene = gene_sample[2], 
           pt = order_df, 
           is.glmm = TRUE, 
           plot.null = FALSE, 
           id.vec = sim_data$subject, 
           gene.counts = counts_mat)

After generating a suitable set of models, we can cluster the genes in a semi-supervised fashion using clusterGenes(). This function uses the Leiden algorithm (the default), hierarchical clustering, or k-means and selects the best set of clustering hyperparameters using the silhouette score. If desired PCA can be run prior to clustering, but the default is to cluster on the fitted values themselves. We then use the results as input to plotClusteredGenes(), which generates a table of fitted values per-gene, per-lineage over pseudotime along with the accompanying cluster labels.

gene_clusters <- clusterGenes(de_test_glm, clust.algo = "leiden")
gene_clust_table <- plotClusteredGenes(de_test_glm, 
                                       gene.clusters = gene_clusters, 
                                       pt = order_df, 
                                       n.cores = 2)
slice_sample(gene_clust_table, n = 5) %>% 
  knitr::kable("pipe", 
               digits = 3, 
               row.names = FALSE, 
               col.names = c("Gene", "Lineage", "Cell", "Fitted (link)", "Fitted (response)", "Pseudotime", "Cluster"))
Gene Lineage Cell Fitted (link) Fitted (response) Pseudotime Cluster
ELOVL2 A 133 -3.999 0.018 0.332 3
AMH A 278 -3.401 0.033 0.695 3
TIMP1 A 976 3.113 22.492 0.440 2
TMC6 A 1158 -2.239 0.107 0.895 3
PFDN2 A 814 0.974 2.649 0.035 2

The results can then be plotted as desired using ggplot2 or another visualization package. Upon visual inspection, the genes seem to cluster based on whether they are dynamic or static over pseudotime.

ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) + 
  facet_wrap(~paste0("Cluster ", CLUSTER)) + 
  geom_line(alpha = 0.75) + 
  scale_y_continuous(labels = scales::label_number(accuracy = 1)) + 
  scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) + 
  labs(x = "Pseudotime", 
       y = "Fitted Values", 
       color = "Leiden\nCluster", 
       title = "Unsupervised Clustering of Gene Patterns") + 
  theme_classic(base_size = 14) + 
  guides(color = guide_legend(override.aes = list(size = 3, alpha = 1)))

Checking the true frequency of dynamic genes in each cluster seems to confirm that hypothesis:

distinct(gene_clust_table, GENE, CLUSTER) %>% 
  inner_join(gene_status_df, by = c("GENE" = "gene")) %>% 
  with_groups(CLUSTER, 
              summarise, 
              P_DYNAMIC = mean(True_Gene_Status)) %>% 
  knitr::kable("pipe", 
               digits = 3, 
               col.names = c("Leiden Cluster", "Dynamic Gene Frequency"))
Leiden Cluster Dynamic Gene Frequency
1 0.862
2 0.926
3 0.000

Conclusions & Best Practices

In general, starting with the GLM backend is probably your best bet unless you have a strong prior belief that expression trends will differ significantly between subjects. If that is the case, you should use the GEE backend if you’re interested in population-level estimates, but are worried about wrongly predicting differential expression when differences in expression are actually caused by inter-subject variation. If you’re interested in generating subject-specific estimates then the GLMM backend should be used; take care when interpreting the fixed vs. random effects though, and consult a biostatistician if necessary.

If you have a large dataset (10,000+ cells), you should start with the GLM backend, since estimates don’t differ much between modeling methods given high enough n. In addition, running the tests on an HPC cluster with 4+ CPUs and 64+ GB of RAM will help your computations to complete swiftly. Datasets with smaller numbers of cells or fewer genes of interest may be easily analyzed in an R session on a local machine.

Contact Information

This package is developed & maintained by Jack Leary. Feel free to reach out by opening an issue or by email ([email protected]) if more detailed assistance is needed.

References

  1. Bacher, R. et al. Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization. Nucleic Acids Research (2021).

  2. Warton, D. & J. Stoklosa. A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics (2018).

  3. Nelder, J. & R. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society (1972).

  4. Liang, K. & S. Zeger. Longitudinal data analysis using generalized linear models. Biometrika (1986).

  5. Laird, N. & J. Ware. Random-effects models for longitudinal data. Biometrics (1988).

About

Characterize gene dynamics over trajectories using GLMs, GEEs, & GLMMs.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 99.2%
  • C++ 0.8%