Skip to content

Commit

Permalink
updated README & more docs tweaks
Browse files Browse the repository at this point in the history
  • Loading branch information
jr-leary7 committed Jul 15, 2023
1 parent a9efb88 commit ccaac09
Show file tree
Hide file tree
Showing 6 changed files with 38 additions and 41 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: scLANE
Type: Package
Title: Model gene expression dynamics with spline-based NB GLMs, GEEs, & GLMMs
Version: 0.7.0
Version: 0.7.1
Authors@R: c(person(given = "Jack", family = "Leary", email = "[email protected]", role = c("aut", "cre")),
person(given = "Rhonda", family = "Bacher", email = "[email protected]", role = c("ctb", "fnd")))
Description: This package uses truncated power basis spline models to build flexible, interpretable models of single cell gene expression over pseudotime or latent time.
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# scLANE 0.7.1

* Changed input format of all functions to allow counts matrices formatted as `SingleCellExperiment` or `Seurat` objects, sparse matrices, or dense matrices.
* Updated visualization functions to reflect changes made in `ggplot2` v3.4 (mostly changing the `size` parameter in line-based geoms to be `linewidth` instead).

# scLANE 0.6.3

* Added a `NEWS.md` file to track changes to the package.
23 changes: 10 additions & 13 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,12 @@ plotPCA(sim_data, colour_by = "cell_time_normed")

## Trajectory DE Testing

Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We'll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dense matrix with cells as rows & genes as columns (i.e., transposed from the way it's stored in `SingleCellExperiment` & `Seurat` objects), a dataframe containing our cell ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it's usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we'll select 50 genes each from the dynamic and non-dynamic populations. **Note**: in this case we're working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to `testDynamic()`.
Since we have multi-subject data, we can use any of the three model backends to run our DE testing. We'll start with the simplest model, the GLM, then work our way through the other options in order of increasing complexity. We first prepare our inputs - a dataframe containing our cell ordering, a set of genes to build models for, and a vector of per-cell size factors to be used as offsets during estimation. In reality, it's usually unnecessary to fit a model for every single gene in a dataset, as trajectories are usually estimated using a subset of the entire set of genes (usually a few thousand most highly variable genes). For the purpose of demonstration, we'll select 50 genes each from the dynamic and non-dynamic populations. **Note**: in this case we're working with a single pseudotime lineage, though in real datasets several lineages often exist; in order to fit models for a subset of lineages simply remove the corresponding columns from the cell ordering dataframe passed as input to `testDynamic()`.

```{r prep-data}
set.seed(312)
gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50),
sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50))
counts_mat <- as.matrix(t(counts(sim_data)[gene_sample, ]))
order_df <- data.frame(X = sim_data$cell_time_normed)
cell_offset <- createCellOffset(sim_data)
```
Expand All @@ -111,11 +110,10 @@ cell_offset <- createCellOffset(sim_data)
Running `testDynamic()` provides us with a nested list containing model output & DE test results for each gene over each pseudotime / latent time lineage. In this case, since we have a true cell ordering we only have one lineage. Parallel processing is turned on by default, and we use 2 cores here to speed up runtime.

```{r glm-models, results='hold'}
de_test_glm <- testDynamic(expr.mat = counts_mat,
de_test_glm <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.potential.basis.fns = 5,
n.cores = 2,
track.time = TRUE)
```
Expand Down Expand Up @@ -149,11 +147,10 @@ caret::confusionMatrix(factor(de_res_glm$Gene_Dynamic_Overall, levels = c(0, 1))
The function call is essentially the same when using the GLM backend, with the exception of needing to provide a sorted vector of subject IDs & a desired correlation structure. We also need to flip the `is.gee` flag in order to indicate that we'd like to fit estimating equations models (instead of mixed models). Since fitting GEEs is a fair bit more computationally complex than fitting GLMs, DE testing with the GEE backend takes a bit longer. Using more cores and / or running the tests on an HPC cluster (as this document is being compiled on a 2020 MacBook Pro) speeds things up considerably. Here we specify an [AR1 correlation structure](https://rdrr.io/cran/nlme/man/corAR1.html), which is the default for the GEE backend.

```{r gee-models, results='hold'}
de_test_gee <- testDynamic(expr.mat = counts_mat,
de_test_gee <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.potential.basis.fns = 5,
is.gee = TRUE,
id.vec = sim_data$subject,
cor.structure = "ar1",
Expand Down Expand Up @@ -185,10 +182,10 @@ caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1))

### GLMM Backend

We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects can be stratified by groups such as Treatment & Control and you expect the Treatment group to have a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure. **Note**: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly.
We re-run the DE tests a final time using the GLMM backend. This is the most complex model architecture we support, and is the trickiest to interpret. We recommend using it when you're most interested in how a trajectory differs between subjects e.g., if the subjects belong to groups like Treatment & Control, and you expect the Treatment group to experience a different progression through the biological process. Executing the function with the GLMM backend differs only in that we switch the `is.glmm` flag to `TRUE` and no longer need to specify a working correlation structure. **Note**: the GLMM backend is still under development, as we are working on further reducing runtime and increasing the odds of the underlying optimization process converging successfully. As such, updates will be frequent and functionality / results may shift slightly.

```{r glmm-models}
de_test_glmm <- testDynamic(expr.mat = counts_mat,
de_test_glmm <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
Expand All @@ -200,7 +197,7 @@ de_test_glmm <- testDynamic(expr.mat = counts_mat,
track.time = TRUE)
```

Like the GLM backend, the GLMM backend use a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject.
Like the GLM backend, the GLMM backend uses a likelihood ratio test to compare the null & alternate models. We fit the two nested models using maximum likelihood estimation instead of [REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in order to perform this test; the null model in this case is a negative binomial GLMM with a random intercept for each subject.

```{r glmm-results}
de_res_glmm <- getResultsDE(de_test_glmm)
Expand All @@ -211,7 +208,7 @@ select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Ove
col.names = c("Gene", "Lineage", "LRT Stat.", "P-value", "Adj. P-value", "Predicted Dynamic Status"))
```

The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it's more appropriate to use these more complex models when you expect expression dynamics to differ between subjects, with the difference being that you should use the GEE backend if you're interested in population-level trends & the GLMM backend if you're interested in per-subject trends. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best.
The GLMM backend performs about as well as the GEE backend. Like with the GEE backend, it's more appropriate to use these more complex models if expression dynamics might differ between subjects, with the difference being that you should use the GEE backend if you're interested in population-level trends & the GLMM backend if you're interested in per-subject trends. Since the dynamics in our simulated data are strongly conserved across subjects, it follows that the simpler GLMs perform the best.

```{r eval-glmm}
de_res_glmm <- inner_join(de_res_glmm,
Expand All @@ -232,7 +229,7 @@ We can use the `plotModels()` to visually compare different types of modeling ba
plotModels(de_test_glm,
gene = "JARID2",
pt = order_df,
expr.mat = counts_mat,
expr.mat = sim_data,
size.factor.offset = cell_offset)
```

Expand All @@ -242,7 +239,7 @@ When plotting the models generated using the GLMM backend, we split by lineage &
plotModels(de_test_glmm,
gene = "WAPAL",
pt = order_df,
expr.mat = counts_mat,
expr.mat = sim_data,
size.factor.offset = cell_offset,
is.glmm = TRUE,
plot.null = FALSE,
Expand Down Expand Up @@ -283,7 +280,7 @@ ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE))
color = "Leiden\nCluster",
title = "Unsupervised Clustering of Gene Patterns") +
theme_classic(base_size = 14) +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1)))
guides(color = guide_legend(override.aes = list(linewidth = 3, alpha = 1)))
```

Checking the true frequency of dynamic genes in each cluster seems to somewhat confirm that hypothesis:
Expand Down
49 changes: 22 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,9 +160,7 @@ plotPCA(sim_data, colour_by = "cell_time_normed")
Since we have multi-subject data, we can use any of the three model
backends to run our DE testing. We’ll start with the simplest model, the
GLM, then work our way through the other options in order of increasing
complexity. We first prepare our inputs - a dense matrix with cells as
rows & genes as columns (i.e., transposed from the way it’s stored in
`SingleCellExperiment` & `Seurat` objects), a dataframe containing our
complexity. We first prepare our inputs - a dataframe containing our
cell ordering, a set of genes to build models for, and a vector of
per-cell size factors to be used as offsets during estimation. In
reality, it’s usually unnecessary to fit a model for every single gene
Expand All @@ -179,7 +177,6 @@ dataframe passed as input to `testDynamic()`.
set.seed(312)
gene_sample <- c(sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "Dynamic"], size = 50),
sample(rownames(sim_data)[rowData(sim_data)$geneStatus_overall == "NotDynamic"], size = 50))
counts_mat <- as.matrix(t(counts(sim_data)[gene_sample, ]))
order_df <- data.frame(X = sim_data$cell_time_normed)
cell_offset <- createCellOffset(sim_data)
```
Expand All @@ -193,14 +190,13 @@ have one lineage. Parallel processing is turned on by default, and we
use 2 cores here to speed up runtime.

``` r
de_test_glm <- testDynamic(expr.mat = counts_mat,
de_test_glm <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.potential.basis.fns = 5,
n.cores = 2,
track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 26.588 secs"
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 26.572 secs"
```

After the function finishes running, we use `getResultsDE()` to generate
Expand Down Expand Up @@ -286,17 +282,16 @@ structure](https://rdrr.io/cran/nlme/man/corAR1.html), which is the
default for the GEE backend.

``` r
de_test_gee <- testDynamic(expr.mat = counts_mat,
de_test_gee <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
n.potential.basis.fns = 5,
is.gee = TRUE,
id.vec = sim_data$subject,
cor.structure = "ar1",
n.cores = 2,
track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.849 mins"
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.129 mins"
```

We again generate the table of DE test results. The variance of the
Expand Down Expand Up @@ -368,19 +363,19 @@ caret::confusionMatrix(factor(de_res_gee$Gene_Dynamic_Overall, levels = c(0, 1))
We re-run the DE tests a final time using the GLMM backend. This is the
most complex model architecture we support, and is the trickiest to
interpret. We recommend using it when you’re most interested in how a
trajectory differs between subjects e.g., if the subjects can be
stratified by groups such as Treatment & Control and you expect the
Treatment group to have a different progression through the biological
process. Executing the function with the GLMM backend differs only in
that we switch the `is.glmm` flag to `TRUE` and no longer need to
specify a working correlation structure. **Note**: the GLMM backend is
still under development, as we are working on further reducing runtime
and increasing the odds of the underlying optimization process
converging successfully. As such, updates will be frequent and
functionality / results may shift slightly.
trajectory differs between subjects e.g., if the subjects belong to
groups like Treatment & Control, and you expect the Treatment group to
experience a different progression through the biological process.
Executing the function with the GLMM backend differs only in that we
switch the `is.glmm` flag to `TRUE` and no longer need to specify a
working correlation structure. **Note**: the GLMM backend is still under
development, as we are working on further reducing runtime and
increasing the odds of the underlying optimization process converging
successfully. As such, updates will be frequent and functionality /
results may shift slightly.

``` r
de_test_glmm <- testDynamic(expr.mat = counts_mat,
de_test_glmm <- testDynamic(sim_data,
pt = order_df,
genes = gene_sample,
size.factor.offset = cell_offset,
Expand All @@ -390,10 +385,10 @@ de_test_glmm <- testDynamic(expr.mat = counts_mat,
id.vec = sim_data$subject,
n.cores = 2,
track.time = TRUE)
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 5.306 mins"
#> [1] "testDynamic evaluated 100 genes with 1 lineage apiece in 3.603 mins"
```

Like the GLM backend, the GLMM backend use a likelihood ratio test to
Like the GLM backend, the GLMM backend uses a likelihood ratio test to
compare the null & alternate models. We fit the two nested models using
maximum likelihood estimation instead of
[REML](https://en.wikipedia.org/wiki/Restricted_maximum_likelihood) in
Expand All @@ -419,7 +414,7 @@ select(de_res_glmm, Gene, Lineage, Test_Stat, P_Val, P_Val_Adj, Gene_Dynamic_Ove

The GLMM backend performs about as well as the GEE backend. Like with
the GEE backend, it’s more appropriate to use these more complex models
when you expect expression dynamics to differ between subjects, with the
if expression dynamics might differ between subjects, with the
difference being that you should use the GEE backend if you’re
interested in population-level trends & the GLMM backend if you’re
interested in per-subject trends. Since the dynamics in our simulated
Expand Down Expand Up @@ -483,7 +478,7 @@ the same trend as the `scLANE` model, though the fitted trend from
plotModels(de_test_glm,
gene = "JARID2",
pt = order_df,
expr.mat = counts_mat,
expr.mat = sim_data,
size.factor.offset = cell_offset)
```

Expand All @@ -496,7 +491,7 @@ lineage & color the points by subject ID instead of by lineage.
plotModels(de_test_glmm,
gene = "WAPAL",
pt = order_df,
expr.mat = counts_mat,
expr.mat = sim_data,
size.factor.offset = cell_offset,
is.glmm = TRUE,
plot.null = FALSE,
Expand Down Expand Up @@ -557,7 +552,7 @@ ggplot(gene_clust_table, aes(x = PT, y = FITTED, color = CLUSTER, group = GENE))
color = "Leiden\nCluster",
title = "Unsupervised Clustering of Gene Patterns") +
theme_classic(base_size = 14) +
guides(color = guide_legend(override.aes = list(size = 3, alpha = 1)))
guides(color = guide_legend(override.aes = list(linewidth = 3, alpha = 1)))
```

<img src="man/figures/README-plot-clust-1.png" width="100%" />
Expand Down
Binary file modified man/figures/README-plot-models-glm-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified man/figures/README-plot-models-glmm-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ccaac09

Please sign in to comment.