From 9c71305cfbf7fac6e666d68a3ecb3ec9cd09d409 Mon Sep 17 00:00:00 2001 From: stemangiola Date: Sat, 18 May 2024 16:30:42 +1000 Subject: [PATCH] drop vignettes --- vignettes/Introduction.Rmd | 92 ++ vignettes/Session_1_sequencing_assays.Rmd | 1018 +++++++++++++++++ vignettes/Session_2_Tidy_spatial_analyses.Rmd | 661 +++++++++++ vignettes/Session_3_imaging_assays.Rmd | 427 +++++++ vignettes/tidyGenomicsTranscriptomics.Rmd | 778 ------------- 5 files changed, 2198 insertions(+), 778 deletions(-) create mode 100644 vignettes/Introduction.Rmd create mode 100644 vignettes/Session_1_sequencing_assays.Rmd create mode 100644 vignettes/Session_2_Tidy_spatial_analyses.Rmd create mode 100644 vignettes/Session_3_imaging_assays.Rmd delete mode 100644 vignettes/tidyGenomicsTranscriptomics.Rmd diff --git a/vignettes/Introduction.Rmd b/vignettes/Introduction.Rmd new file mode 100644 index 0000000..6bc2b17 --- /dev/null +++ b/vignettes/Introduction.Rmd @@ -0,0 +1,92 @@ +--- +title: "Introduction to Spatial omic analyses" +author: + - Stefano Mangiola, South Australian immunoGENomics Cancer Institute^[], Walter and Eliza Hall Institute^[] + - Luciano Martellotto, Adelaide Centre for Epigenetics, South Australian immunoGENomics Cancer Institute^[] +output: rmarkdown::html_vignette +# bibliography: "`r file.path(system.file(package='spatial_omics_workshop_2024', 'bibliography'), 'bibliography.bib')`" +vignette: > + %\VignetteIndexEntry{Introduction to Spatial omic analyses} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + markdown: + wrap: 72 +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, cache = FALSE) +``` + +## Instructors + +**Dr. Stefano Mangiola** is leading the Computational Cancer immunology group at the South Australian immunoGENomics Cancer Institute (SAiGENCI). He uses single-cell and spatial technologies to investigate the tumor microenvironment and the immune system. Beyong data production, his focus in on the integration and modelling of large-scale single-cell data resources. He is the author of `tidytranscriptiomics` and co-leads the `tidyomics` endevour. + + +**Dr. Luciano Martelotto** is a key figure in the field of spatial omics technology. He was previously the Scientific Director at the Single Cell Core facility at Harvard University, where he demonstrated his extensive expertise and significant contributions to the fields of single cell and spatial omics technology. Currently, he heads the Martelotto Lab located at the Adelaide Centre for Epigenetics and the South Australian immunoGENomics Cancer Institute (SAiGENCI). His lab is dedicated to the development and evaluation of new tools and methodologies for single cell and spatial omics, serving as an important incubator for innovation in these technologies. The lab engages in global collaborations and provides comprehensive support from the conceptual stage through to the analysis and delivery of results. + +## Workshop goals and objectives + +### What you will learn + +- The basics of spatial profiling technologies +- Analysis and manipulation of sequencing-based spatial data. +- The basics of tidy R analyses of biological data with `tidyomics` +- How to interface `SpatialExperiment` with tidy R manipulation and visualisation +- Analysis and manipulation of imaging-based spatial data. + +## Getting started + +### Local + +You can view the material at the workshop webpage + +[here](https://tidyomics.github.io/spatialOmicsWorkshop2024/articles/main.html). + +### Installation + +```{r, eval=FALSE} + +``` + +# Session 1: Introduction to Spatial Omics + +### Objective + +Provide a foundational understanding of spatial omics, covering different technologies and the distinctions between imaging and +sequencing in experimental and analytical contexts. + +### Workshop Structure + +#### 1. Welcome and Introduction + +- Overview of the workshop. +- Goals for Day 1. + +#### 2. What is Spatial Omics? + +- Definition and significance in modern biology. +- Key applications and impact. + +#### 3. Technologies in Spatial Omics + +- Overview of different spatial omics technologies. +- Comparison of imaging-based vs sequencing-based approaches. + +#### 4. Sequencing Spatial Omics + +- Detailed comparison of methodologies. +- Experimental design considerations. +- Data analysis challenges and solutions. + +#### 5. Overview of Analysis Frameworks + +- Introduction to various analysis frameworks. +- Brief mention of 'tidy' data principles in spatial omics. + +#### 6. Wrap-Up and Q&A + +- Summarize key takeaways. +- Open floor for questions and discussions. + + diff --git a/vignettes/Session_1_sequencing_assays.Rmd b/vignettes/Session_1_sequencing_assays.Rmd new file mode 100644 index 0000000..85ba610 --- /dev/null +++ b/vignettes/Session_1_sequencing_assays.Rmd @@ -0,0 +1,1018 @@ +--- +title: "Sequencing assays" +author: + - Stefano Mangiola, South Australian immunoGENomics Cancer Institute^[], Walter and Eliza Hall Institute^[] +output: rmarkdown::html_vignette +# bibliography: "`r file.path(system.file(package='spatialOmicsWorkshop2024', 'vignettes'), 'tidyomics.bib')`" +vignette: > + %\VignetteIndexEntry{Sequencing assays} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, cache = FALSE) + +``` + + + +# Session 1: Spatial analyses of sequencing data + +### Objective + +Introduce the Bioconductor framework with a focus on the +`SpatialExperiment` package. Guide participants through installation, +data acquisition, and basic data manipulation and visualization. + +### Session Structure + +**1. Introduction to Bioconductor** + +What is Bioconductor? +Its role in spatial omics analysis. + +**Getting Started with SpatialExperiment** + +Installation of `SpatialExperiment` and dependencies. +Overview of the package's capabilities. + +**Downloading Example Dataset** + +Step-by-step guide on acquiring a sample dataset. +Explanation of dataset characteristics. + +**Data Visualization and Manipulation** + +Basic visualization techniques. +Manipulating spatial omics data using `SpatialExperiment`. + +**Excercises and Q&A** + +Q&A session for clarifications and deeper insights. + +### 1. Introduction to Bioconductor + +In this section of our R Markdown workshop, we delve into the world of Bioconductor and its significance in spatial omics analysis. Understanding Bioconductor is fundamental for anyone venturing into the complex and fascinating field of bioinformatics, especially in the analysis of spatially resolved data. + +**What is Bioconductor?** + +Bioconductor is a pioneering software project that specializes in the analysis and comprehension of genomic data. It is an open-source, open-development platform primarily based on the R statistical programming language. + +**The key features of Bioconductor include:** + +- **Extensive Range of Packages:** Bioconductor offers a vast collection of packages that are specifically designed for the analysis of high-throughput genomic data. This includes genomics, transcriptomics, proteomics and other high-throughput data domains + +- **Community-Driven Development:** Bioconductor is developed by a global community of scientists and programmers, ensuring that it stays at the forefront of genomic research and bioinformatics. + +- **Reproducible Research:** It emphasizes reproducibility and provides tools that facilitate the sharing of data, code, and research workflows. + +**Bioconductor's role in Spatial Omics Analysis** + +This is an emerging field that combines traditional omics data (like genomics, transcriptomics) with spatial context. This means understanding how the expression of genes varies across different physical locations within a tissue or cell. Here's how Bioconductor contributes to this field: + +- **Specialized Packages:** Bioconductor offers packages tailored for spatial omics analysis. These packages can handle spatially resolved transcriptomics data, enabling researchers to analyze and visualizec the spatial distribution of gene expression. + +- **Data Management:** It provides robust solutions for managing the complex and large datasets typical in spatial omics, ensuring efficient data handling and processing. + +- **Statistical Analysis Tools:** Bioconductor contains a range of statistical tools specifically designed for the unique challenges of spatial data analysis, such as accounting for spatial autocorrelation and integrating spatial and non-spatial data. + +- **Visualization Capabilities:** With its advanced graphical functionalities, Bioconductor aids in creating insightful visual representations of spatial omics data, which is crucial for understanding spatial patterns in gene expression. + +In the following sections of this workshop, we will explore practical examples and dive deeper into how Bioconductor is used in spatial omics +analyses, including hands-on coding examples. Stay tuned for an engaging journey through spatial omics with Bioconductor! + +### 2. Getting Started with SpatialExperiment + +To begin working with `SpatialExperiment`, it's essential to install the necessary packages from Bioconductor, using the `BiocManager` package. This setup ensures that you have the required tools to manage and analyze spatial omics data effectively. + +`SpatialExperiment` defines an S4 class designed to hold data from spatial-omics experiments. This class builds on the `SingleCellExperiment` by adding functionality to store and access extra information from both spot-based and cell-based platforms, such as spatial coordinates, images, and their metadata. + +Righelli et al. doi: [10.1093/bioinformatics/btac299](https://academic.oup.com/bioinformatics/article/38/11/3128/6575443?login=false) + + +```{r, echo=FALSE, out.width="700px"} +library(here) + +knitr::include_graphics(here("inst/images/spatialExperimentClass.png")) +``` + +```{r, eval=FALSE} +# Install BiocManager +if (!requireNamespace("BiocManager", quietly = TRUE)) + install.packages("BiocManager") + +# Install SpatialExperiment package +BiocManager::install("SpatialExperiment") + +``` + +### 3. Downloading Example Dataset + +Once the packages are installed, we will guide you through the process of working with spatial omics data using the `SpatialExperiment` package. + + +```{r, message=FALSE, warning=FALSE} + + +# Load the SpatialExperiment package + +# For SpatialExperiment the following might be needed +# 1. bash: module load ImageMagick/6.9.11-22 +# 2. R: devtools::install_github("ropensci/magick") + +library(SpatialExperiment) +``` + +Visualization functions for spatial transcriptomics data. + +```{r, message=FALSE, warning=FALSE} +library(ggspavis) +``` + +Utility packages for single-cell data. + +```{r, message=FALSE, warning=FALSE} +library(scuttle) +library(scater) +library(scran) +``` + + +In this section, we explore the handling and processing of spatial transcriptomics data using the `spatialLIBD` and `ExperimentHub` packages. The following R code block retrieves a specific dataset from the `ExperimentHub`, a Bioconductor project designed to manage and distribute large biological data sets. The code efficiently fetches the data, removes any existing dimensional reductions, and filters the dataset to include only selected samples. This approach is essential for analysing spatial patterns in gene expression across multiple samples, and the code below exemplifies how to manipulate these datasets in preparation for further analysis. This process is adapted from the `Banksy` package's vignette, which provides advanced methods for multi-sample spatial transcriptomics. + +Maynard and Torres et al., doi: [10.1038/s41593-020-00787-0](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8095368/) + +```{r, message=FALSE, warning=FALSE} +library(spatialLIBD) +library(ExperimentHub) + +spatial_data <- + ExperimentHub::ExperimentHub() |> + spatialLIBD::fetch_data( eh = _, type = "spe") + +# Clear the reductions +reducedDims(spatial_data) = NULL + +# Select only 3 samples +spatial_data = spatial_data[,spatial_data$sample_id %in% c("151673", "151675", "151676")] + +# Display the object +spatial_data + +``` + +From: https://bookdown.org/sjcockell/ismb-tutorial-2023/practical-session-2.html + +We shows metadata for each cell, helping understand the dataset's structure. + +```{r} +col_data = colData(spatial_data) + +knitr::kable( + head(col_data), + format = "html" +) +``` + +We access and display feature-related information from the dataset. + +```{r} +row_data = rowData(spatial_data) + +knitr::kable( + head(row_data), + format = "html" +) +``` + +Here, we perform a preliminary examination of the assay data contained within the spatial dataset. + +```{r} +assay(spatial_data)[1:20, 75:100] +``` + +### 4. Data Visualization and Manipulation + +Understanding the spatial arrangement of your data is fundamental. We'll demonstrate straightforward methods to visualize spatial data, helping you appreciate the underlying spatial patterns and distributions. + +We will use `ggpavis` package to visualise the data. + +```{r, fig.width=7, fig.height=8} +# image data +imgData(spatial_data) + +# Simple visualization of spatial data +ggspavis::plotSpots(spatial_data) + +``` + +Enhance the visualization by annotating the plots to provide more context within the spatial data. + +Layers = L1-6, white matter = WM + +```{r, fig.width=7, fig.height=8} + +# plot spots with annotation +ggspavis::plotSpots( + spatial_data, + annotate = "spatialLIBD" +) +``` + +Explore additional visualization features offered by the Visium platform. + +```{r, fig.width=7, fig.height=8} +ggspavis::plotVisium(spatial_data) +``` + +This visualization focuses on specific tissue features within the dataset, emphasizing areas of interest. + +```{r, fig.width=7, fig.height=8} +ggspavis::plotVisium( + spatial_data, + fill = "spatialLIBD", + highlight = "in_tissue" +) +``` + +### 5. Quality control and filtering + +We will use the `scater` package [McCarthy et al. 2017](https://academic.oup.com/bioinformatics/article/33/8/1179/2907823?login=true) to compute the three primary QC metrics we discussed earlier. Using the scater Package for QC Metrics: We'll apply the `scater` package to compute three primary quality control metrics. We'll also use `ggspavis` for visualization along with some custom plotting techniques. + +Previously, we visualized both on- and off-tissue spots. Moving forward, we focus on on-tissue spots for more relevant analyses. This block shows how to filter out off-tissue spots to refine the dataset. + +Source [OSTAWorkshopBioc2021](https://lmweber.org/OSTAWorkshopBioc2021/articles/Vignette03_Analysis_workflow.html) + +```{r} +## Dataset dimensions before the filtering +dim(spatial_data) +``` + +Filtering Dataset to Retain Only On-Tissue Spots: We then refine our dataset by keeping only those spots that are on-tissue, aligning with our focus for subsequent analysis. The dimensions of the dataset after filtering are displayed to confirm the changes. + +```{r} +## Subset to keep only on-tissue spots +spatial_data <- spatial_data[, colData(spatial_data)$in_tissue == 1] +dim(spatial_data) +``` + +#### Mitochondrial + +Next, we identify mitochondrial genes, as they are indicative of cell health. Cells with high mitochondrial gene expression typically indicate poor health or dying cells, which we aim to exclude. + +```{r} +## Classify genes as "mitochondrial" (is_mito == TRUE) +## or not (is_mito == FALSE) +is_gene_mitochondrial <- grepl("(^MT-)|(^mt-)", rowData(spatial_data)$gene_name) +rowData(spatial_data)$gene_name[is_gene_mitochondrial] +``` + +After identifying mitochondrial genes, we apply quality control metrics to further clean the dataset. This involves adding per-cell QC measures and setting a threshold to exclude cells with high mitochondrial transcription. + +```{r} +## Add per-cell QC metrics to spatial data using identified mitochondrial genes +spatial_data <- scater::addPerCellQC( + spatial_data, + subsets = list(mito = is_gene_mitochondrial) +) + +## Select expressed genes threshold +qc_mitochondrial_transcription <- colData(spatial_data)$subsets_mito_percent > 30 + +## Check how many spots are filtered out +table(qc_mitochondrial_transcription) +``` + +After applying the QC metrics, it’s crucial to visually assess their impact. This step involves checking the spatial pattern of the spots removed based on high mitochondrial transcription, helping us understand the distribution and quality of the remaining dataset. + +```{r, fig.width=7, fig.height=8} +## Add threshold in colData +colData(spatial_data)$qc_mitochondrial_transcription <- qc_mitochondrial_transcription + +## Check for putative spatial pattern of removed spots +plotQC( + spatial_data, + type = "spots", + discard = "qc_mitochondrial_transcription", +) + + facet_wrap(~sample_id) + +``` + +This analysis focuses on examining the distribution of library sizes across different spots. It uses a histogram and density plot to visualize the range and commonality of library sizes in the dataset. + +```{r, fig.width=7, fig.height=8} +## Density and histogram of library sizes +data.frame(colData(spatial_data)) |> + ggplot(aes(x = sum)) + + geom_histogram(aes(y = after_stat(density)), bins = 60) + + geom_density() + + scale_x_log10() + + xlab("Library size") + + ylab("Density") + + theme_classic() +``` + +#### Library size + +Setting Library Size Threshold: After examining the library sizes, a threshold is applied to identify spots with library sizes below 700, which are considered for potential exclusion from further analysis. + +```{r} +## Select library size threshold +qc_total_counts <- colData(spatial_data)$sum < 700 + +## Check how many spots are filtered out +table(qc_total_counts) +``` + +Incorporating Library Size Threshold in Dataset: This step involves adding the library size threshold to the dataset’s metadata and examining the spatial pattern of the spots that have been removed based on this criterion. + +```{r, fig.width=7, fig.height=8} + +## Add threshold in colData +colData(spatial_data)$qc_total_counts <- qc_total_counts + +## Check for putative spatial pattern of removed spots +plotQC( + spatial_data, + type = "spots", + discard = "qc_total_counts", +) + + facet_wrap(~sample_id) + + +``` + +Analyzing Gene Expression Per Spot: This analysis examines how many genes are expressed per spot, using a histogram and density plot to visualize the distribution of gene counts across the dataset. + +```{r, fig.width=7, fig.height=8} +## Density and histogram of library sizes +data.frame(colData(spatial_data) ) |> + ggplot(aes(x = detected)) + + geom_histogram(aes(y = after_stat(density)), bins = 60) + + geom_density() + + scale_x_log10() + + xlab("Number of genes with > 0 counts") + + ylab("Density") + + theme_classic() +``` + +#### Detected genes + +Setting Gene Expression Threshold: This block applies a threshold to identify spots with fewer than 500 detected genes, considering these for exclusion to ensure data quality. + +```{r} +## Select expressed genes threshold +qc_detected_genes <- colData(spatial_data)$detected < 500 +## Check how many spots are filtered out +table(qc_detected_genes) +``` + +Incorporating Gene Expression Threshold in Dataset: After setting the gene expression threshold, it is added to the dataset's metadata. The spatial pattern of spots removed based on this threshold is then examined. + +```{r, fig.width=7, fig.height=8} +## Add threshold in colData +colData(spatial_data)$qc_detected_genes <- qc_detected_genes + +## Check for putative spatial pattern of removed spots +plotQC( + spatial_data, + type = "spots", + discard = "qc_detected_genes", +) + + facet_wrap(~sample_id) + +``` + +Exploring the Relationship Between Library Size and Number of Genes Detected: This analysis explores the correlation between library size and the number of genes detected in each spot, providing insights into data quality and sequencing depth. + +```{r, fig.width=7, fig.height=8} +## Density and histogram of library sizes +data.frame(colData(spatial_data)) |> + ggplot(aes(sum, detected)) + + geom_point(shape=".") + + scale_x_log10() + + scale_y_log10() + + xlab("Library size") + + ylab("Number of genes with > 0 counts") + + theme_classic() +``` + +#### Combined filtering + +After applying all QC filters, this block combines them and stores the results in the dataset. The spatial patterns of all discarded spots are then reviewed to ensure comprehensive quality control. + +```{r, fig.width=7, fig.height=8} + +## Store the set in the object +colData(spatial_data)$discard <- qc_total_counts | qc_detected_genes | qc_mitochondrial_transcription + +## Check the spatial pattern of combined set of discarded spots +plotQC( + spatial_data, + type = "spots", + discard = "discard", +) + + facet_wrap(~sample_id) + +``` + +The final step in data preprocessing involves removing all spots identified as low-quality based on the previously applied thresholds, refining the dataset for subsequent analyses. + +```{r} +spatial_data = spatial_data[,!colData(spatial_data)$discard ] +``` + +### 6. Dimensionality reduction + +Dimensionality reduction is essential in spatial transcriptomics due to the high-dimensional nature of the data, which includes vast gene expression profiles across various spatial locations. Techniques such as PCA (Principal Component Analysis) and UMAP (Uniform Manifold Approximation and Projection) are particularly valuable. PCA helps to reduce noise and highlight the most significant variance in the data, making it simpler to uncover underlying patterns and correlations. UMAP, ofen calculated from principal components (and not directly from features) preserves both global and local data structures, enabling more nuanced visualizations of complex cellular landscapes. Together, these methods facilitate a deeper understanding of spatial gene expression, helping to reveal biological insights such as cellular heterogeneity and tissue structure, which are crucial for both basic biological research and clinical applications. + +#### Variable gene identification + +```{r, message=FALSE, warning=FALSE, fig.width=7, fig.height=8} + +genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(spatial_data)) +dec = scran::modelGeneVar(spatial_data, subset.row = genes) + + +# Visualisation +plot(dec$mean, dec$total, xlab = "Mean log-expression", ylab = "Variance") +curve(metadata(dec)$trend(x), col = "blue", add = TRUE) + +# Get top variable genes +dec = scran::modelGeneVar(spatial_data, subset.row = genes, block = spatial_data$sample_id) +hvg = scran::getTopHVGs(dec, n = 1000) + +``` + +#### PCA + +With the highly variable genes, we perform PCA to reduce dimensionality, followed by UMAP to visualize the data in a lower-dimensional space, enhancing our ability to observe clustering and patterns in the data. + +```{r, fig.width=7, fig.height=8} +spatial_data <- + spatial_data |> + scuttle::logNormCounts() |> + scater::runPCA(subset_row = hvg) + +## Check correctness - names +reducedDimNames(spatial_data) + +reducedDim(spatial_data, "PCA")[1:5, 1:5] +``` + +#### UMAP + +You can appreciate that, in this case, selecting within-sample variable genes, we do not see major batch effects across samples. We see two major pixel clusters. + +```{r, fig.width=7, fig.height=8} +set.seed(42) +spatial_data <- scater::runUMAP(spatial_data, dimred = "PCA") + +scater::plotUMAP(spatial_data, colour_by = "sample_id", point_size = 0.2) +``` + +:::: {.note} +**Exercise** + +Visualise where the two macro clusters are located spatially. We will take a very pragmatic approach and get cluster label from splitting the UMAP coordinated in two (`colData()` and `reducedDim()` will help us, see above), and then visualise it with `ggspavis`. +:::: + +```{r, fig.width=7, fig.height=8} +# Label +colData(spatial_data)$macro_cluster = reducedDim(spatial_data, "UMAP")[,"UMAP1"] > -2.5 + +# Verify +scater::plotUMAP(spatial_data, colour_by = "macro_cluster", point_size = 0.2) + +ggspavis::plotVisium( + spatial_data, + fill = "macro_cluster", + highlight = "in_tissue" +) +``` + +### 7. Clustering + +Clustering in spatial transcriptomics is crucial for understanding the intricate cellular composition of tissues. This technique groups cells/pixels based on similar gene expression profiles, enabling the identification of distinct cell types and states within a tissue's spatial context. Clustering reveals patterns of cellular organization and differentiation, and interactions in the microenvironment. + +First, we establish the number of nearest neighbors to use in the k-NN graph. This graph forms the basis for clustering, using the Walktrap algorithm to detect community structures that suggest natural groupings or clusters in the data. `buildSNNGraph` is from the `scan` package. + + +```{r} + +## Set number of Nearest-Neighbours (NNs) +k <- 10 + +## Build the k-NN graph +g_walk <- + spatial_data |> + scran::buildSNNGraph( + k = 10, + use.dimred = "PCA" + ) |> + igraph::cluster_walktrap() + +clus <- g_walk$membership +## Check how many +table(clus) +``` + +Applying Clustering Labels and Visualizing Results: After determining clusters, we apply these labels back to the spatial data and visualize the results using UMAP. This allows us to observe how the data clusters in a reduced dimension space, and further visualize how these clusters map onto the physical tissue sections for context. + +We can appreciate here that we get two main pixel clusters. + +```{r, fig.width=7, fig.height=8} +colLabels(spatial_data) <- factor(clus) + +scater::plotUMAP(spatial_data, colour_by = "label") + scale_color_brewer(palette = "Paired") +``` + +Those two clusters group the white matter from the rest of the layers. + +```{r, fig.width=7, fig.height=8} +## Plot in tissue map +ggspavis::plotSpots(spatial_data, annotate = "label") + scale_color_brewer(palette = "Paired") +``` + +As for comparison, we show the manually annotated regions. We can see that while the single cell style clustering catchers, the overall tissue, architecture, a lot of details are not retrieved. We clusters cannot faithfully recapitulate the tissue morphology. However, they might represent specific cell types within morphological regions. + +```{r, fig.width=7, fig.height=8} +## Plot ground truth in tissue map +ggspavis::plotSpots(spatial_data, annotate = "spatialLIBD", + palette = "libd_layer_colors") + +``` + +This setup ensures that the reasoning and methodology behind each step of the clustering process are clear and the visualization steps demonstrate the practical application of clustering results to spatial data analysis. + +#### Spatially-aware clustering + +This step involves initializing the computation of BANKSY neighborhood feature matrices for each sample independently. We define k_geom as 6, which corresponds to the number of first-order neighbors in a 10x Visium assay. This process is crucial for capturing local neighborhood information within the spatial data. + +[Singhal et al., 2024](https://www.nature.com/articles/s41588-024-01664-3) + +[Material source](https://bioconductor.org/packages/release/bioc/vignettes/Banksy/inst/doc/multi-sample.html) + +```{r, eval=TRUE, message=FALSE, warning=FALSE} +library(Banksy) + +# scale the counts, without log transformation +spatial_data = spatial_data |> logNormCounts(log=FALSE, name = "normcounts") +``` + +##### Highly-variable genes + +The Banksy documentation, suggest the use of `Seurat` for the detection of highly variable genes. + +```{r, message=FALSE, warning=FALSE} +library(Seurat) + +# Convert to list +spatial_data_list_for_seurat <- lapply(unique(spatial_data$sample_id), function(x) + spatial_data[, spatial_data$sample_id == x ] +) + +seu_list <- lapply(spatial_data_list_for_seurat, function(x) { + x <- as.Seurat(x, data = NULL) + NormalizeData(x, scale.factor = 5000, normalization.method = 'RC') +}) + +# Compute HVGs +hvgs <- lapply(seu_list, function(x) { + VariableFeatures(FindVariableFeatures(x, nfeatures = 2000)) +}) +hvgs <- Reduce(union, hvgs) +rm(seu_list, spatial_data_list_for_seurat) +``` + +```{r, message=FALSE, warning=FALSE} +# Convert to list +spatial_data_list <- lapply(unique(spatial_data$sample_id), function(x) + spatial_data[ + hvgs, + spatial_data$sample_id == x + ] +) + +spatial_data_list <- lapply( + spatial_data_list, + computeBanksy, # Compute the component neighborhood matrices + assay_name = "normcounts" +) + +spe_joint <- do.call(cbind, spatial_data_list) +``` + +Here, we perform PCA using the BANKSY algorithm on the joint dataset. The group argument specifies how to treat different samples, ensuring that features are scaled separately per sample group to account for variations among them. + +:::: {.note} +Note: this step takes long time +:::: + +```{r, eval=TRUE, message=FALSE, warning=FALSE} +spe_joint <- runBanksyPCA( # Run PCA on the Banskly matrix + spe_joint, + lambda = 0.2, # spatial weighting parameter. Larger values (e.g. 0.8) incorporate more spatial neighborhood + group = "sample_id", # Features belonging to the grouping variable will be z-scaled separately. + seed = 42 +) +``` + +Once the dimensional reduction is complete, we cluster the spots across all samples and use `connectClusters` to visually compare these new BANKSY clusters against manual annotations. + +```{r, eval=TRUE, message=FALSE, warning=FALSE} +spe_joint <- clusterBanksy( # clustering on the principal components computed on the BANKSY matrix + spe_joint, + lambda = 0.2, # spatial weighting parameter. Larger values (e.g. 0.8) incorporate more spatial neighborhood + resolution = 0.7, # numeric vector specifying resolution used for clustering (louvain / leiden). + seed = 42 +) +colData(spe_joint)$clust_annotation = colData(spe_joint)$Cluster + +spe_joint <- connectClusters(spe_joint) +``` + + +As an optional step, we smooth the cluster labels for each sample independently, which can enhance the visual coherence of clusters, especially in heterogeneous biological samples. + +From SpiceMix paper [Chidester et al., 2023](https://www.nature.com/articles/s41588-022-01256-z) + +```{r, eval=TRUE, message=FALSE, warning=FALSE} +spatial_data_list <- lapply( + unique(spe_joint$sample_id), + function(x) + spe_joint[, spe_joint$sample_id == x] +) + +spatial_data_list <- lapply( + spatial_data_list, + smoothLabels, + cluster_names = "clust_M0_lam0.2_k50_res0.7", + k = 6L, + verbose = FALSE +) +names(spatial_data_list) <- paste0("sample_", unique(spe_joint$sample_id)) +``` + +The raw and smoothed cluster labels are stored in the `colData` slot of each +`SingleCellExperiment` or `SpatialExperiment` object. + +```{r, eval=TRUE} +cluster_metadata = + colData(spatial_data_list$sample_151673)[, c("clust_M0_lam0.2_k50_res0.7", "clust_M0_lam0.2_k50_res0.7_smooth")] + + +knitr::kable(head(cluster_metadata), format = "html") +``` + +Using cluster comparison metrics like the adjusted Rand index (ARI) and normalized mutual information (NMI), we evaluate the performance of our clustering approach. This statistical analysis helps validate the clustering results against known labels or pathologies. + +```{r, eval=TRUE} +compareClusters(spatial_data_list$sample_151673, func = 'ARI') +``` + +Calculating and Displaying Clustering Metrics for Each Sample: We calculate the ARI and NMI for each sample to assess the consistency and accuracy of our clustering across different samples. + +```{r, eval=TRUE} +ari <- sapply(spatial_data_list, function(x) as.numeric(tail(compareClusters(x, func = "ARI")[, 1], n = 1))) +ari +``` + +```{r, eval=TRUE} +nmi <- sapply(spatial_data_list, function(x) as.numeric(tail(compareClusters(x, func = "NMI")[, 1], n = 1))) +nmi +``` + +Visualizing Clusters and Annotations on Spatial Coordinates: We utilize the ggspavis package to visually map BANKSY clusters and manual annotations onto the spatial coordinates of the dataset, providing a comprehensive visual overview of spatial and clustering data relationships. + +```{r multi-sample-spatial, eval=TRUE, fig.width=7, fig.height=8} +# Use scater:::.get_palette('tableau10medium') +library(cowplot) + +pal <- c( + "#1abc9c", "#3498db", "#9b59b6", "#e74c3c", "#34495e", + "#f39c12", "#d35400", "#7f8c8d", "#2ecc71", "#e67e22" +) + +plot_bank_smooth <- lapply(spatial_data_list, function(x) { + plotSpots(x, annotate = sprintf("%s_smooth", "clust_M0_lam0.2_k50_res0.7"), pal = pal) + + theme(legend.position = "none") + + labs(title = "BANKSY clusters") +}) + + +plot_grid(plotlist = plot_bank_smooth, ncol = 3, byrow = TRUE) + +plotSpots(spatial_data, annotate = "spatialLIBD", + palette = "libd_layer_colors") + + theme(legend.position = "none") + + labs(title = "spatialLIBD regions") +``` + + +:::: {.note} +**Exercise** + +We have applied cluster smoothing using `smoothLabels`. How much do you think this operation has affected the cluster labels. To find out, + +- Plot the non smoothed cluster +- identify the pixel that have been smoothed, and +- visualise them using `plotQC` that we have used above. + +:::: + +```{r, fig.width=7, fig.height=8} + +spe_joint <- do.call(cbind, spatial_data_list) + + + plotSpots(spe_joint, annotate = sprintf("%s", "clust_M0_lam0.2_k50_res0.7"), size = 0.8, pal = pal) + + theme(legend.position = "none") + + labs(title = "BANKSY clusters") + +``` + + + +```{r, fig.width=7, fig.height=8} + +spe_joint$has_changed = !spe_joint$clust_M0_lam0.2_k50_res0.7 == spe_joint$clust_M0_lam0.2_k50_res0.7_smooth + + + +plotQC( + spe_joint, + type = "spots", + discard = "has_changed" +) + + facet_wrap(~sample_id) + +``` + +### 8. Deconvolution of pixel-based spatial data + +https://bioconductor.org/packages/devel/bioc/vignettes/SPOTlight/inst/doc/SPOTlight_kidney.html + +#### Producing the reference for single-cell databases + +[CuratedAtlasQuery](https://stemangiola.github.io/CuratedAtlasQueryR/) is a query interface that allow the programmatic exploration and retrieval of the harmonised, curated and reannotated CELLxGENE single-cell human cell atlas. Data can be retrieved at cell, sample, or dataset levels based on filtering criteria. + +Harmonised data is stored in the ARDC Nectar Research Cloud, and most CuratedAtlasQuery functions interact with Nectar via web requests, so a network connection is required for most functionality. + +Mangiola et al., 2024 doi [doi.org/10.1101/2023.06.08.542671](https://www.biorxiv.org/content/10.1101/2023.06.08.542671v3) + +```{r, echo=FALSE, out.width="700px"} +knitr::include_graphics(here("inst/images/curated_atlas_query.png")) +``` + +```{r, message=FALSE, warning=FALSE, fig.width=7, fig.height=8} +# Get reference +library(CuratedAtlasQueryR) +library(HDF5Array) + +tmp_file_path = tempfile() + +brain_reference = + + # Query metadata across 30M cells + get_metadata() |> + + # Filter your data of interest + dplyr::filter(tissue_harmonised=="brain", disease == "normal", organism == "Mus musculus") |> + + # Collect pseudobulk as SummarizedExperiment + get_pseudobulk(cache_directory = "/vast/projects/cellxgene_curated") |> + + # Normalise for Spotlight + scuttle::logNormCounts() |> + + # Save for fast reading + HDF5Array::saveHDF5SummarizedExperiment(tmp_file_path, replace = TRUE) +``` + +```{r, message=FALSE} +library(HDF5Array) + +brain_reference = HDF5Array::loadHDF5SummarizedExperiment(tmp_file_path) + +my_metadata = colData(brain_reference) + +knitr::kable(head(my_metadata), format = "html") +``` + +These are the cell types included in our reference, and the number of pseudobulk samples we have for each cell type. + +```{r} + +table(brain_reference$cell_type_harmonised) + +``` + +These are the number of samples we have for each of the three data sets. + +```{r} + +table(brain_reference$dataset_id) +``` + +The `collection_id` can be used to gather information on the cell database. e.g. https://cellxgene.cziscience.com/collections/ + +```{r} +table(brain_reference$collection_id) +``` + + +Now, we identify the variable genes within each dataset, to not capture technical effects, and identify the union of variable genes for further analysis. + +```{r, warning=FALSE} +genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(brain_reference)) + +# Convert to list +brain_reference_list <- lapply(unique(brain_reference$dataset_id), function(x) brain_reference[, brain_reference$dataset_id == x]) + +dec = scran::modelGeneVar(brain_reference, subset.row = genes, block = brain_reference$sample_id) +hvg_CAQ = scran::getTopHVGs(dec, n = 1000) + + + +hvg_CAQ = unique( unlist(hvg_CAQ)) + +head(hvg_CAQ) +``` + +Initially, the code prepares the spatial data by setting gene names as row identifiers. + +```{r} +spatial_data_gene_name = spatial_data +rownames(spatial_data_gene_name) <- rowData(spatial_data_gene_name)$gene_name +spatial_data_gene_name = logNormCounts(spatial_data_gene_name) +``` + +We then identify and score relevant marker genes based on their expression and significance across different cell types. The result is a curated list of high-potential marker genes, organized and ready for deeper analysis and interpretation in the context of spatial gene expression patterns. + +```{r} +mgs <- scran::scoreMarkers( + brain_reference, + groups = brain_reference$cell_type_harmonised, + + # Omit mitochondrial genes and keep all the genes in spatial + subset.row = + grep("(^MT-)|(^mt-)|(\\.)|(-)", rownames(brain_reference), value=TRUE, invert=TRUE) |> + intersect(rownames(spatial_data_gene_name)) +) + +# Select the most informative markers +mgs_df <- lapply(names(mgs), function(i) { + x <- mgs[[i]] + + # Filter and keep relevant marker genes, those with AUC > 0.8 + x <- x[x$mean.AUC > 0.8, ] + + # Sort the genes from highest to lowest weight + x <- x[order(x$mean.AUC, decreasing = TRUE), ] + + # Add gene and cluster id to the dataframe + x$gene <- rownames(x) + x$cluster <- i + data.frame(x) +}) +mgs_df <- do.call(rbind, mgs_df) + +head(mgs_df) +``` + +We now utilise `SPOTlight` to deconvolve tissue composition from our independent mouse brain reference. The result is visualized through a scatter pie plot, which provides a graphical representation of the spatial distribution of cell types within a specific sample. This visualization aids in understanding the spatial heterogeneity. + +```{r, fig.width=7, fig.height=8, message=FALSE, warning=FALSE} +library(SPOTlight) + +res <- SPOTlight( + x = brain_reference |> assay("logcounts"), + y = spatial_data_gene_name |> assay("logcounts"), + groups = brain_reference$cell_type_harmonised, + group_id = "cluster", + mgs = mgs_df, + hvg = intersect(hvg_CAQ, rownames(spatial_data_gene_name)), + weight_id = "mean.AUC", + gene_id = "gene" +) + +cell_first_sample = colData(spatial_data_gene_name)$sample_id=="151673" + +plotSpatialScatterpie( + x = spatial_data_gene_name[,cell_first_sample], + y = res$mat[cell_first_sample,], + cell_types = colnames(res$mat[cell_first_sample,]), + img = FALSE, + scatterpie_alpha = 1, + pie_scale = 0.4 +) + +``` + +Let's visualise without pericyte_cell and endothelial cells, which occupy a lot of the visual spectrum. + +```{r, fig.width=7, fig.height=8} + +plotSpatialScatterpie( + x = spatial_data_gene_name[,cell_first_sample], + y = res$mat[cell_first_sample,-c(2,9)], + cell_types = colnames(res$mat[cell_first_sample,-c(2,9)]), + img = FALSE, + scatterpie_alpha = 1, + pie_scale = 0.4 +) + +``` +Let's also exclude without muscle_cell, which occupy a lot of the visual spectrum. + +```{r, fig.width=7, fig.height=8} + +plotSpatialScatterpie( + x = spatial_data_gene_name[,cell_first_sample], + y = res$mat[cell_first_sample,-c(2, 9, 5)], + cell_types = colnames(res$mat[cell_first_sample,-c(2, 9, 5)]), + img = FALSE, + scatterpie_alpha = 1, + pie_scale = 0.4 +) + +``` + +No, let's look at the correlation matrices to see which cell type are most often occurring rather than mutually exclusive within our data set. + +```{r, fig.width=7, fig.height=8} + +plotCorrelationMatrix(res$mat) +``` + +#### Excercise + +Rather than looking at the correlation matrix, overall, let's observe whether the correlation structure amongst cell types is consistent across samples. Do you think it's consistent or noticably different? + +```{r, fig.width=7, fig.height=8} + +res_spatialLIBD = split(data.frame(res$mat), colData(spatial_data_gene_name)$sample_id ) + +lapply(res_spatialLIBD, function(x) plotCorrelationMatrix(as.matrix(x[,-10]))) + +``` + +Now let's observe whether the correlation structure is consistent across spatial regions, irrespectively of the sample of origin. Do you think they are consistent or noticably different? + +```{r, fig.width=7, fig.height=8} + +res_spatialLIBD = split(data.frame(res$mat), colData(spatial_data_gene_name)$spatialLIBD ) + +lapply(res_spatialLIBD, function(x) plotCorrelationMatrix(as.matrix(x[,-10]))) +``` + +Some of the most positive correlations involve the end of cells with Oligodendrocytes and Leptomeningeal cells. + +Leptomeningeal cells refer to the cells that make up the leptomeninges, which consist of two of the three layers of the meninges surrounding the brain and spinal cord: the arachnoid mater and the pia mater. These layers play a critical role in protecting the central nervous system and assisting in various physiological processes. + +Oligodendrocytes are a type of glial cell in the central nervous system (CNS) of vertebrates, including humans and mouse. These cells are crucial for the formation and maintenance of the myelin sheath, a fatty layer that encases the axons of many neurons. + +Hello, let's try to visualise the pixel where these cell types most occur. + +```{r, fig.width=7, fig.height=8} + +mat_df = as.data.frame(res$mat) + +is_endothelial_leptomeningeal = mat_df$endothelial_cell >0.1 & mat_df$leptomeningeal_cell>0.1 & mat_df$endothelial_cell + mat_df$leptomeningeal_cell > 0.4 +is_endothelial_oligodendrocytes = mat_df$endothelial_cell >0.1 & mat_df$oligodendrocyte>0.05 & mat_df$endothelial_cell + mat_df$oligodendrocyte > 0.4 + +spatial_data$is_endothelial_leptomeningeal = is_endothelial_leptomeningeal +spatial_data$is_endothelial_oligodendrocyte = is_endothelial_oligodendrocytes + +plotSpots(spatial_data, annotate = "is_endothelial_leptomeningeal") + + scale_color_manual(values = c("TRUE"= "red", "FALSE" = "grey")) +theme(legend.position = "none") + + labs(title = "endothelial + leptomeningeal") + +plotSpots(spatial_data, annotate = "is_endothelial_oligodendrocyte") + + scale_color_manual(values = c("TRUE"= "blue", "FALSE" = "grey")) +theme(legend.position = "none") + + labs(title = "endothelial + oligodendrocyte") + +``` + +**Session Information** + +```{r} +sessionInfo() +``` + +**References** + +```{css echo=FALSE} +.note { + margin: 30px; + padding: 1em; + background: #FFF8F0; + border: 1px solid #EFE8E0; + border-radius: 10px; +} +``` \ No newline at end of file diff --git a/vignettes/Session_2_Tidy_spatial_analyses.Rmd b/vignettes/Session_2_Tidy_spatial_analyses.Rmd new file mode 100644 index 0000000..8bd736a --- /dev/null +++ b/vignettes/Session_2_Tidy_spatial_analyses.Rmd @@ -0,0 +1,661 @@ +--- +title: "Tidy spatial analyses" +author: + - Stefano Mangiola, South Australian immunoGENomics Cancer Institute^[], Walter and Eliza Hall Institute^[] +output: rmarkdown::html_vignette +# bibliography: "`r file.path(system.file(package='spatialOmicsWorkshop2024', 'vignettes'), 'tidyomics.bib')`" +vignette: > + %\VignetteIndexEntry{Tidy spatial analyses} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + markdown: + wrap: 72 +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, cache = FALSE) +library(here) +``` + + + +# Session 2: Tidying spatial data + +A good introduction of `tidyomics` can be found here + +[tidyomicsWorkshopBioc2023](https://github.com/tidyomics/tidyomicsWorkshopBioc2023) +[tidy transcriptomic manifesto](https://tidyomics.github.io/tidyomicsBlog/post/2021-07-07-tidy-transcriptomics-manifesto/) + +`tidyomics` is an interoperable software ecosystem that bridges Bioconductor and the tidyverse. `tidyomics` is installable with a single homonymous meta-package. This ecosystem includes three new packages: tidySummarizedExperiment, tidySingleCellExperiment, and tidySpatialExperiment, and five publicly available R packages: `plyranges`, `nullranges`, `tidyseurat`, `tidybulk`, `tidytof`. Importantly, `tidyomics` leaves the original data containers and methods unaltered, ensuring compatibility with existing software, maintainability and long-term Bioconductor support. + +`tidyomics` is presented in "The tidyomics ecosystem: Enhancing omic data analyses" [Hutchison and Keyes et al., 2024](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v1) + +```{r, echo=FALSE, out.width="700px"} +knitr::include_graphics(here("inst/images/tidyomics.png")) +``` + +[Slides](https://docs.google.com/gview?url=https://raw.githubusercontent.com/tidytranscriptomics-workshops/LoveMangiola2022_tidytranscriptomics/master/inst/LoveMangiola2022_tidytranscriptomics.pdf) + + + +Let's load the libraries needed for this session + +```{r, message = FALSE} +library(SpatialExperiment) + +# Tidyverse +library(ggplot2) +library(plotly) +library(dplyr) +library(tidyr) +library(purrr) +library(glue) +library(stringr) + +# Plotting +library(colorspace) +library(dittoSeq) +library(ggspavis) + +# Analysis +library(scuttle) +library(scater) +library(scran) + +``` + +Similarly to **Section 2**, this section uses `spatialLIBD` and `ExperimentHub` packages to gather spatial transcriptomics data. + +doi: [10.1038/s41593-020-00787-0](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8095368/) + + +```{r, message = FALSE} +# From https://bioconductor.org/packages/devel/bioc/vignettes/Banksy/inst/doc/multi-sample.html +library(spatialLIBD) +library(ExperimentHub) + +spatial_data <- + ExperimentHub::ExperimentHub() |> + spatialLIBD::fetch_data( eh = _, type = "spe") + +# Clear the reductions +reducedDims(spatial_data) = NULL + +# Display the object +spatial_data +``` + +### 1. tidySpatialExperiment package + +`tidySpatialExperiment` provides a bridge between the `SpatialExperiment` single-cell package and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the `SpatialExperiment` object as a tidyverse tibble, and provides `SpatialExperiment`-compatible `dplyr`, `tidyr`, `ggplot` +and `plotly` functions. + +If we load the `tidySpatialExperiment` package and then view the single cell data, it now displays as a tibble. + +```{r message = FALSE} +library(tidySpatialExperiment) + +spatial_data +``` + +#### Data interface, display + +If we want to revert to the standard SpatialExperiment view we can do that. + +```{r} +options("restore_SpatialExperiment_show" = TRUE) +spatial_data +``` + +If we want to revert back to tidy SpatialExperiment view we can. + +```{r} +options("restore_SpatialExperiment_show" = FALSE) +spatial_data +``` + +#### Original behaviour is preserved + +The tidy representation behaves exactly as a native `SpatialExperiment`. It can be interacted with using [SpatialExperiment commands](https://www.bioconductor.org/packages/release/bioc/vignettes/SpatialExperiment/inst/doc/SpatialExperiment.html) +such as `assays`. + +```{r} +assays(spatial_data) +``` + +### 2. Tidyverse commands + +We can also interact with our object as we do with any tidyverse tibble. We can use `tidyverse` commands, such as `filter`, `select` and `mutate` to explore the `tidySpatialExperiment` object. Some examples are shown below and more can be seen at the `tidySpatialExperiment` [website](https://stemangiola.github.io/tidySpatialExperiment/articles/introduction.html#tidyverse-commands-1). + +#### Filter + +We can use `filter` to choose rows, for example, to select our three samples we are going to work with. + + +```{r} +spatial_data = + spatial_data |> + filter(sample_id %in% c("151673", "151675", "151676")) + +spatial_data +``` + +In comparison the base-R method recalls the variable multiple times + +```{r, eval=FALSE} +spatial_data = spatial_data[,spatial_data$sample_id %in% c("151673", "151675", "151676")] +``` + +Or for example, to see just the rows for the cells in G1 cell-cycle stage. + +```{r} +spatial_data |> dplyr::filter(spatialLIBD == "L1") +``` + +:::: {.note} +Note that **rows** in this context refers to rows of the abstraction, not **rows** of the SpatialExperiment which correspond to genes **tidySpatialExperiment** prioritizes cells as the units of observation in the abstraction, while the full dataset, including measurements of expression of all genes, is still available "in the background". +:::: + +#### Select + +We can use `select` to view columns, for example, to see the filename, total cellular RNA abundance and cell phase. + +If we use `select` we will also get any view-only columns returned, such as the UMAP columns generated during the preprocessing. + +```{r} +spatial_data |> select(.cell, sample_id, in_tissue, spatialLIBD) +``` + +#### Mutate + +We can use `mutate` to create a column. For example, we could create a new `Phase_l` column that contains a lower-case version of `Phase`. + +In this case, three columns that are view only (`sample_id`, `pxl_col_in_fullres`, `pxl_row_in_fullres`, `PC*`) will be always included in the tidy representation because they cannot be omitted from the data container (is opposed to metadata) + +```{r message=FALSE} +spatial_data |> + mutate(spatialLIBD_lower = tolower(spatialLIBD)) |> + select(.cell, spatialLIBD, spatialLIBD_lower) +``` + +We can use tidyverse commands to polish an annotation column. We will extract the sample, and group information from the file name column into separate columns. + +```{r message=FALSE} + +# Simulate file path +spatial_data = spatial_data |> mutate(file_path = glue("../data/single_cell/{sample_id}/outs/raw_feature_bc_matrix/")) + + +# First take a look at the file column +spatial_data |> select(.cell, file_path) +``` + +#### Extract + +Extract specific identifiers from complex data paths, simplifying the dataset by isolating crucial metadata. This process allows for clearer identification of samples based on their file paths, improving data organization. + +```{r} +# Create column for sample +spatial_data <- spatial_data |> + # Extract sample ID from file path and display the updated data + tidyr::extract(file_path, "sample_id_from_file_path", "data/spatial/([a-zA-Z0-9_-]+)/outs.+", remove = FALSE) + +# Take a look +spatial_data |> select(.cell, sample_id_from_file_path, everything()) +``` + +#### Unite + +We could use tidyverse `unite` to combine columns, for example to create a new column for sample id combining the sample and subject id +(BCB) columns. + +```{r message=FALSE} +spatial_data <- spatial_data |> unite("sample_subject", sample_id, subject, remove = FALSE) + +# Take a look +spatial_data |> select(.cell, sample_id, sample_subject, subject) +``` + + +### 3. Advanced filtering/gating and pseudobulk + +`tidySpatialExperiment` provide a interactive advanced tool for gating region of interest for streamlined exploratory analyses. + +This capability is powered by `tidygate`. We show how you can visualise your data and manually drawing gates to select one or more regions of interest using an intuitive tidy grammar. From https://bioconductor.org/packages/devel/bioc/vignettes/tidySpatialExperiment/inst/doc/overview.html + +Let's draw an arbitrary gate interactively + +```{r, eval=FALSE} + +spatial_data = + spatial_data |> + + # Filter one sample + filter(in_tissue, sample_id=="151673") |> + + # Gate based on tissue morphology + tidySpatialExperiment::gate_spatial(opacity = 0.2, image_index = 1) +``` + +This is recorded in the `.gate` column + +```{r, eval=FALSE} + +spatial_data |> select(.cell, .gate) +``` + +We can count how many pixels we selected with simple `tidyverse` grammar + +```{r, eval=FALSE} +spatial_data |> count(.gate) +``` + +We can visualise the gating + + +```{r, eval=FALSE} +spatial_data |> + + # Plot our gate + plotSpots( + annotate = ".gate", + palette = "libd_layer_colors" + ) + + labs(title = ".gate regions") + +``` + +```{r, echo=FALSE, out.width="300px"} +knitr::include_graphics(here("inst/images/tidySPE_gate.png")) +``` + +And filter, for further analyses + +```{r, eval=FALSE} +spatial_data |> + filter(.gate) +``` + +#### Summarisation/aggregation + +The gated cells can then be divided into pseudobulks within a SummarizedExperiment object using tidySpatialExperiment’s aggregate_cells utility function. + +```{r , eval=FALSE} +spe_regions_aggregated <- + spe_regions |> + aggregate_cells(c(sample_id, region)) + +spe_regions_aggregated +``` + + +### 4. tidyfying your workflow + +We will take workflow used in **Session 2**, performed using mostly base R syntax and convert it to tidy R syntax. We will show you how the readability and modularity of your workflow will improve. + +#### Subset to keep only on-tissue spots. + +**Base R approach:** + +```{r, eval=FALSE} +spatial_data <- spatial_data[, colData(spatial_data)$in_tissue == 1] +``` + +**Tidyverse Approach:** + +```{r} +spatial_data <- + spatial_data |> + filter(in_tissue == 1) +``` + +**Specific Differences and Advantages:** + +The `tidyverse` `filter()` function clearly states the intent to filter the dataset, whereas the Base R approach uses subsetting which might not be immediately clear to someone unfamiliar with the syntax. + +The `tidyverse` approach inherently supports chaining further operations without manually checking dimensions, assuming that users trust the operation to behave as expected. + +#### Manipulating feature information + +:::: {.note} +For `SingleCellExperiment` there is no tidy API for manipulating feature wise data yet, on the contrary for `SummarizedExperiment`, because gene-centric the abstraction allow for direct gene information manipulation. Currently, `tidySingleCellExperiment` and `tidySpatialExperiment` do not prioritize the manipulation of features (genes). + +While these functions can employ genes for cell manipulation and visualisation, as demonstrated in `join_features()`, they lack tools for altering feature-related information. Instead, their primary focus is on cell information, which serves as the main observational unit in single-cell data. This contrasts with bulk RNA sequencing data, where features are more central. +:::: + +The tidy API for `SingleCellExperiment` has feature-manipulation API among our plans. See [tidyomics challenges](https://github.com/orgs/tidyomics/projects/1) + +**Base R approach:** + +```{r} +is_gene_mitochondrial <- grepl("(^MT-)|(^mt-)", rowData(spatial_data)$gene_name) +rowData(spatial_data)$gene_name[is_gene_mitochondrial] +``` + +#### Quality Control: + +Apply quality control measures to exclude cells based on mitochondrial content and read/gene count, a common indicator of cell health and viability. + +**Base R approach:** + +```{r, eval=FALSE} +spatial_data <- addPerCellQC(spatial_data, subsets = list(mito = is_gene_mitochondrial)) + +## Select expressed genes threshold +qc_mitochondrial_transcription <- colData(spatial_data)$subsets_mito_percent > 30 +colData(spatial_data)$qc_mitochondrial_transcription <- qc_mitochondrial_transcription + +``` + +**Tidyverse Approach:** + +```{r} + +spatial_data <- + spatial_data |> + + # Add QC + addPerCellQC(subsets = list(mito = is_gene_mitochondrial)) |> + + ## Add threshold in colData + mutate( + qc_mitochondrial_transcription = subsets_mito_percent > 30 + ) + +spatial_data + +``` + +**Specific Differences and Advantages:** + +`tidyverse` pipelines these operations without storing intermediate results, directly updating the dataset. Base R separates these steps, requiring manual tracking of variables and updating the dataset in multiple steps, increasing complexity and potential for errors. + +Direct Data Mutation: Tidyverse directly mutates the dataset within the pipeline, whereas Base R extracts, computes, and then reassigns values, which can be more verbose and less efficient in terms of workflow clarity and execution. + +#### Group-specific analyses + +**Base R approach:** + +```{r, eval=FALSE, fig.width=7, fig.height=8} +# get gene for subset +genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(spatial_data)) + +# Convert to list +spatial_data_list <- lapply(unique(spatial_data$sample_id), function(x) spatial_data[, spatial_data$sample_id == x]) + +# Detect sample-specific hughly-variable genes +marker_genes = + lapply( spatial_data_list, + function(x){ + dec = scran::modelGeneVar(x, subset.row = genes) + scran::getTopHVGs(dec, n = 1000) + } + ) + +head(unique(unlist(marker_genes))) + +``` + +**Tidyverse Approach: group_split** + +```{r, fig.width=7, fig.height=8} +# get gene for subset +genes <- !grepl(pattern = "^Rp[l|s]|Mt", x = rownames(spatial_data)) + +marker_genes = + spatial_data |> + group_split(sample_id) |> + map(~ + .x |> + scran::modelGeneVar(subset.row = genes) |> + scran::getTopHVGs(n = 1000) + ) |> + reduce(union) + +marker_genes |> head() +``` + +**Tidyverse Approach: nest** + +```{r, fig.width=7, fig.height=8} + +spatial_data |> + nest(sample_data = -sample_id) |> + mutate(marker_genes = map(sample_data, ~ + .x |> + scran::modelGeneVar(subset.row = genes) |> + scran::getTopHVGs(n = 1000) + )) + +``` + + + +**Specific Differences and Advantages:** + +`tidyverse` neatly handles grouping and plotting within a single chain, using `nest()` or `group_split()` and `map()` for compartmentalized operations, which organizes the workflow into a coherent sequence. + +tidyverse's `map()` is a powerful functional language tool, which can return arbitrary types, such as `map_int`, `map_char`, `map_lgl`.It is integrated into the data manipulation workflow, making it part of the data pipeline. + +#### Multi-parameter filtering + +**Base R approach:** + +```{r, eval=FALSE} +## # Mitochondrial transcription +qc_mitochondrial_transcription <- colData(spatial_data)$subsets_mito_percent > 30 +colData(spatial_data)$qc_mitochondrial_transcription <- qc_mitochondrial_transcription + +# ## Select library size threshold +qc_total_counts <- colData(spatial_data)$sum < 700 +colData(spatial_data)$qc_total_counts <- qc_total_counts + +# ## Select expressed genes threshold +qc_detected_genes <- colData(spatial_data)$detected < 500 +colData(spatial_data)$qc_detected_genes <- qc_detected_genes + +# ## Find combination to filter +colData(spatial_data)$discard <- qc_total_counts | qc_detected_genes | qc_mitochondrial_transcription + +# # Filter +spatial_data = spatial_data[,!colData(spatial_data)$discard ] +``` + +**Tidyverse Approach:** + +```{r} + +spatial_data_filtered = + spatial_data |> + mutate( + discard = + subsets_mito_percent > 30 | + sum < 700 | + detected < 500 + ) |> + filter(!discard) +``` + +**Specific Differences and Advantages:** + +**Tidyverse:** The code directly applies multiple filtering conditions within a single filter() function, making it highly readable and concise. The conditions are clearly laid out, and the operation directly modifies the spatial_data dataframe. This approach is more intuitive for managing complex filters as it condenses them into a singular functional expression. + +**Base R:** The approach first calculates each condition and stores them within the colData of the dataset. These conditions are then combined to create a logical vector that flags rows to discard. Finally, it subsets the data by removing rows that meet any of the discard conditions. This method is more verbose and requires manually handling intermediate logical vectors, which can introduce errors and complexity in tracking multiple data transformations. + +**Why tidyverse might be better in this context:** + +**Coding efficiency:** `tidyverse` chains operations, reducing the need for intermediate variables and making the code cleaner and less error-prone. + +**Readability:** The filter conditions are all in one place, which simplifies understanding what the code does at a glance, especially for users familiar with the tidyverse syntax. + +**Maintainability:** Fewer and self-explanatory lines of code and no need for intermediate steps make the code easier to maintain and modify, especially when conditions change or additional filters are needed. + + +### 5. Visualisation + +Here, we will show how to use ad-hoc spatial visualisation, as well as `ggplot` to explore spatial data we will show how `tidySpatialExperiment` allowed to alternate between tidyverse visualisation, and any visualisation compatible with `SpatialExperiment`. + +#### Ad-hoc visualisation: Plotting the regions + +Let’s visualise the regions that spatialLIBD labelled across three Visium 10X samples. + +```{r} +spatial_data_filtered |> + plotSpots( + annotate = "spatialLIBD", + palette = "libd_layer_colors" + ) + + theme(legend.position = "none") + + labs(title = "spatialLIBD regions") +``` + +#### Custom visualisation: Plotting the regions + + + +```{r} +spatial_data |> + ggplot(aes(array_row, array_col)) + + geom_point(aes(color = spatialLIBD)) + + facet_wrap(~sample_id) + + theme(legend.position = "none") + + labs(title = "spatialLIBD regions") +``` + +#### Custom visualisation: Plotting RNA output + +Now, let's observe what is the difference in total transcriptional cell output across regions. We can appreciate that different regions of these Visium slide is characterised by significantly different total RNA output. For example, the region one has a low R&D output, on the contrary regions to an L3, characterised by a high RNA output. + +We could conclude that when we use thresholding to filter "low-quality" pixels we have to be careful about possible biological and spatial effects. + +```{r, fig.width=7, fig.height=4} + +spatial_data_filtered |> + ggplot(aes(sum_umi, color = spatialLIBD)) + + geom_density() + + facet_wrap(~sample_id) + + scale_color_manual(values = libd_layer_colors |> str_remove("ayer")) + + scale_x_log10() + + theme_bw() + +``` + +We provide another example of how the use of tidy. Spatial experiment makes custom visualisation, very easy and intuitive, leveraging `ggplot` functionalities. We will observe the relationship between mitochondrial transcription percentage, and total gene counts. We expect this relationship to be inverse as cells with higher mitochondrial transcription percentage tent to have a more limited transcriptional gene pool (e.g. for dieying or damaged cells). + +```{r, fig.width=7, fig.height=8} + +spatial_data_filtered |> + ggplot(aes(subsets_mito_percent, sum_gene)) + + geom_point(aes(color = spatialLIBD), size=0.2) + + stat_ellipse(aes(group = spatialLIBD), alpha = 0.3) + + scale_color_manual(values = libd_layer_colors |> str_remove("ayer")) + + scale_y_log10() + + theme_bw() + +``` + +Interestingly, if we plot the correlation between these two quantities we observe heterogeneity among regions, with L1 showing a very small association. + + +```{r, fig.width=7, fig.height=8} + +spatial_data_filtered |> + ggplot(aes(subsets_mito_percent, sum_gene)) + + geom_point(aes(color = spatialLIBD), size=0.2) + + scale_color_manual(values = libd_layer_colors |> str_remove("ayer")) + + geom_smooth(method="lm") + + facet_wrap(~spatialLIBD) + + scale_y_log10() + + theme_bw() + +``` + +Let's take a step further and group the correlations according to samples, to see whether different samples show different correlations. + +```{r, fig.width=7, fig.height=8} + +spatial_data_filtered |> + ggplot(aes(subsets_mito_percent, sum_gene)) + + geom_point(aes(color = spatialLIBD), size=0.2) + + scale_color_manual(values = libd_layer_colors |> str_remove("ayer")) + + geom_smooth(aes(group = sample_id), method="lm") + + facet_wrap(~spatialLIBD) + + scale_y_log10() + + theme_bw() + +``` + +As you can appreciate, the relationship between the number of genes, probed Purcell and their mitochondrial prescription abundance it's quite consistent. + +:::: {.note} +**Excercise** + +To to practice the use of `tidyomics` on spatial data, we propose a few exercises that connect manipulation, calculations and visualisation. These exercises are just meant to be simple use cases that exploit tidy R streamlined language. + + +We assume that the cells we filtered as non-alive or damaged, characterised by being reached uniquely for mitochondrial, genes, and genes, linked to up ptosis. it is good practice to check these assumption. This exercise aims to estimate what genes are differentially expressed between filtered and unfiltered cells. Then visualise the results + +Use `tidyomic`s/`tidyverse` tools to label dead cells and perform differential expression within each region. Some of the comments you can use are: `mutate`, `nest`, `aggregate_cells`. +:::: + +**Solution** + +```{r} +library(tidySummarizedExperiment) +library(tidybulk) + +differential_analysis = + spatial_data |> + mutate( + dead = + + # Stringent threshold + subsets_mito_percent > 20 | + sum < 700 | + detected < 500 + ) |> + aggregate_cells(c(sample_id, spatialLIBD, dead)) |> + keep_abundant(factor_of_interest = c(dead)) |> + nest(data = - spatialLIBD) |> + + # filter regions having both alive and dead cells + filter( map_int(data, ~ .x |> distinct(sample_id, dead) |> nrow() ) == 6 ) |> + mutate(data = map( + data, + test_differential_abundance, + ~ dead + sample_id, + method = "edgeR_quasi_likelihood", + test_above_log2_fold_change = log(2) + )) + +differential_analysis |> + mutate(data = map(data, pivot_transcript)) |> + unnest(data) |> + filter(FDR<0.05) + +# tidybulk::test_differential_abundance(~ dead + sample_id + (1 | spatialLIBD), method = "glmmseq_lme4") +``` + +**Session Information** + +```{r} +sessionInfo() +``` + +**References** + +```{css echo=FALSE} +.note { + margin: 30px; + padding: 1em; + background: #FFF8F0; + border: 1px solid #EFE8E0; + border-radius: 10px; +} +``` \ No newline at end of file diff --git a/vignettes/Session_3_imaging_assays.Rmd b/vignettes/Session_3_imaging_assays.Rmd new file mode 100644 index 0000000..733a8fa --- /dev/null +++ b/vignettes/Session_3_imaging_assays.Rmd @@ -0,0 +1,427 @@ +--- +title: "Imaging assays (tidy)" +author: + - Stefano Mangiola, South Australian immunoGENomics Cancer Institute^[], Walter and Eliza Hall Institute^[] + - Luciano Martellotto, Adelaide Centre for Epigenetics, South Australian immunoGENomics Cancer Institute^[] +output: rmarkdown::html_vignette +# bibliography: "`r file.path(system.file(package='spatialOmicsWorkshop2024', 'vignettes'), 'tidyomics.bib')`" +vignette: > + %\VignetteIndexEntry{Imaging assays (tidy)} + %\VignetteEncoding{UTF-8} + %\VignetteEngine{knitr::rmarkdown} +editor_options: + markdown: + wrap: 72 +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, cache = FALSE) +``` + +# Session 3 – Spatial analyses of imaging data + +### 1. Working with imaging-based data in Bioconductor with MoleculeExperiment + +```{r} +# https://bioconductor.org/packages/devel/data/experiment/vignettes/SubcellularSpatialData/inst/doc/SubcellularSpatialData.html +# BiocManager::install("stemangiola/SubcellularSpatialData") + +library(ExperimentHub) +library(scater) +library(scran) +library(scuttle) +library(tidySingleCellExperiment) +library(tidySummarizedExperiment) +library(tidySpatialExperiment) +library(ggplot2) +library(dplyr) +library(purrr) +library(tibble) +library(ggalt) +library(magrittr) +library(stringr) +library(RColorBrewer) +library(ggspavis) +library(here) +``` + +#### SubcellularSpatialData + +This data package contains annotated datasets localized at the sub-cellular level from the STOmics, Xenium, and CosMx platforms, as analyzed in the publication by [Bhuva et al., 2024](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03241-7). It includes raw transcript detections and provides functions to convert these into `SpatialExperiment` objects. + +```{r, eval=FALSE} +library(SubcellularSpatialData) +eh = ExperimentHub(cache = "/vast/scratch/users/mangiola.s") +query(eh, "SubcellularSpatialData") + +# Brain Mouse data +tx = eh[["EH8230"]] +tx |> filter(sample_id=="Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs") |> nrow() +# 62,744,602 +``` + +From this large dataset, we select a small reagion for illustrative purposes + +```{r, eval=FALSE} +tx_small_region = + tx |> + filter(x |> between(3700, 4200), y |> between(5000, 5500)) +``` + +Load the pre-saved data + +```{r} +tx_small_region = spatialOmicsWorkshop2024::tx_small_region +``` + +#### An overview of the data + + +```{r, fig.width=7, fig.height=8} +# tx_small = tx[sample(seq_len(nrow(tx)), size = nrow(tx)/500),] +# tx_small |> saveRDS("xenium_SubcellularSpatialData.rds", compress = "xz") + +tx_small = spatialOmicsWorkshop2024::tx_small + + +tx_small |> + ggplot(aes(x, y, colour = region)) + + geom_point(pch = ".") + + facet_wrap(~sample_id, ncol = 2) + + coord_fixed() + + theme_minimal() + + theme(legend.position = "none") +``` + + +#### A preview to MoleculeExperiment + +The R package MoleculeExperiment includes functions to create and manipulate objects from the newly introduced MoleculeExperiment class, designed for analyzing molecule-based spatial transcriptomics data from platforms such as Xenium by 10X, CosMx SMI by Nanostring, and Merscope by Vizgen, among others. + +Although in this session we will not use `MoleculeExperiment` class, because of the lack of segmentation boundary information (we rather have cell identifiers), we briefly introduce this package because as an important part of Bioconductor. + +We show how we would import our table of probe location into a `MoleculeExperiment`. At the end of the Session, for knowledge, we will navigate the example code given in the [vignette material](https://www.bioconductor.org/packages/release/bioc/vignettes/MoleculeExperiment/inst/doc/MoleculeExperiment.html). + + +```{r} +library(MoleculeExperiment) + +me = + tx_small_region |> + select(sample_id, gene, x, y) |> + dataframeToMEList( + dfType = "molecules", + assayName = "detected", + sampleCol = "sample_id", + factorCol = "gene", + xCol = "x", + yCol = "y" + ) |> + MoleculeExperiment() + +me +``` + +#### A preview of a zoomed in section of the tissue + +Now let's try to visualise just a small section. You can appreciate, coloured by cell, single molecules. You cqn also appreciate the difference in density between regions. An aspect to note, is that not all probes are withiin cells. This depends on the segmentation process. + +```{r, fig.width=7, fig.height=8} +brewer.pal(7, "Set1") + +tx_small_region |> + filter(!is.na(cell)) |> + slice_sample(prop = 0.2) |> + ggplot(aes(x, y, colour = factor(cell))) + + geom_point(shape=".") + + + facet_wrap(~sample_id, ncol = 2) + + scale_color_manual(values = sample(colorRampPalette(brewer.pal(8, "Set2"))(1800))) + + coord_fixed() + + theme_minimal() + + theme(legend.position = "none") +``` + + +Let's have a look to the probes that have not being unassigned to cells. + +```{r, fig.width=7, fig.height=8} + +tx_small_region |> + filter(is.na(cell)) |> + ggplot(aes(x, y, colour = factor(cell))) + + geom_point(shape=".") + + + facet_wrap(~sample_id, ncol = 2) + + scale_color_manual(values = sample(colorRampPalette(brewer.pal(8, "Set2"))(1800))) + + coord_fixed() + + theme_minimal() + + theme(legend.position = "none") +``` + + +We can appreciate how, even subsampling the data 1 in 500, we still have a vast amount of data to visualise. + + +### 2. Aggregation and analysis + +We will convert our cell by gene count to a `SpatialExperiment`. This object stores a cell by gene matrix with relative XY coordinates. + + +```{r} + +tx_spe = + spatialOmicsWorkshop2024::tx_spe |> + + # Scaling and tranformation + logNormCounts() + +``` + +https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html + +```{r, fig.width=7, fig.height=8} +tx_spe |> + + # Gene variance + scran::modelGeneVar(block = tx_spe$sample_id) |> + as.tibble(rownames = "feature") |> + + # Plot + ggplot(aes(mean, total)) + + geom_point() + + geom_smooth(color="red")+ + xlab("Mean log-expression") + + ylab("Variance") + + theme_bw() + +``` + + +```{r} +# Get the top 2000 genes. +top.hvgs = + tx_spe |> + + # grouping + scran::modelGeneVar(block = tx_spe$sample_id) |> + + # Model gene variance and select variable genes per sample + getTopHVGs(n=200) +``` + + +```{r} +tx_spe_sample_1 = + tx_spe |> + filter(sample_id=="1") |> + slice_sample(prop = 0.2) + +# The selected subset of genes can then be passed to the subset.row argument (or equivalent) in downstream steps. This process is demonstrated below for the PCA step: + +tx_spe_sample_1 = + tx_spe_sample_1 |> + fixedPCA( subset.row=top.hvgs ) + +``` + + +We then use the gene expression to cluster sales based on their similarity and represent these clusters in a two dimensional embeddings (UMAP) + +```{r} +tx_spe_sample_1 = + tx_spe_sample_1 |> + mutate(clusters = + tx_spe_sample_1 |> + clusterCells( + use.dimred="PCA", + BLUSPARAM=bluster::NNGraphParam(k=20, cluster.fun="louvain") + ) |> + as.character() +) + + +## Check how many +tx_spe_sample_1 = + tx_spe_sample_1 |> + runUMAP() + +``` + +```{r, fig.width=7, fig.height=8} +tx_spe_sample_1 |> + plotUMAP(colour_by = "clusters") + + scale_color_discrete( + brewer.pal(n = 30, name = "Set1") + ) +``` + +Plot ground truth in tissue map + +```{r, fig.width=7, fig.height=8} + +tx_spe_sample_1 |> + mutate(in_tissue = TRUE) |> + plotSpots(annotate = "region", + palette = colorRampPalette( brewer.pal(9,"Set1") )(150) + ) + + guides(color = "none") + +``` + +### 3. Cell type caracterisation + +```{r} + + +``` + +### 4. Neighborhood analyses + + +https://www.bioconductor.org/packages/release/bioc/vignettes/hoodscanR/inst/doc/Quick_start.html + + +https://divingintogeneticsandgenomics.com/post/neighborhood-cellular-niches-analysis-with-spatial-transcriptome-data-in-seurat-and-bioconductor/ + +:::: {.note} +**Excercise** + +**Spatial-aware clustering:** Apply the spatial aware clustering method BANKSY. Taking as example the code run for session 2. +:::: + +### 5. Neighborhood analysis + +```{r, fig.width=7, fig.height=8, message=FALSE} + +library(hoodscanR) +library(scico) +library(ggspavis) + +``` + +In order to perform neighborhood scanning, we need to firstly identify k (in this example, k = 100) nearest cells for each cells. The searching algorithm is based on Approximate Near Neighbor (ANN) C++ library from the RANN package. + +```{r} +tx_spe_neighbours = + tx_spe_sample_1 |> + readHoodData(anno_col = "clusters") |> + findNearCells(k = 100) + +tx_spe_neighbours +``` + +The output of findNearCells function includes two matrix, an annotation matrix and a distance matrix. + +```{r} +tx_spe_neighbours$cells[1:10, 1:5] + +tx_spe_neighbours$distance[1:10, 1:5] + +``` + +We can then perform neighborhood analysis using the function scanHoods. This function incldue the modified softmax algorithm, aimming to genereate a matrix with the probability of each cell associating with their 100 nearest cells. + +```{r} + # Calculate neighbours +pm <- scanHoods(tx_spe_neighbours$distance) + + # We can then merge the probabilities by the cell types of the 100 nearest cells. We get the probability distribution of each cell all each neighborhood. +hoods <- mergeByGroup(pm, tx_spe_neighbours$cells) + +hoods |> head() +``` + + +We plot randomly plot 10 cells to see the output of neighborhood scanning using plotHoodMat. In this plot, each value represent the probability of the each cell (each row) located in each cell type neighborhood. The rowSums of the probability maxtrix will always be 1. + +```{r, fig.width=7, fig.height=8} +hoods |> plotHoodMat(n = 10, hm_height = 5) +``` + +We can then merge the neighborhood results with the `SpatialExperiment` object using `mergeHoodSpe` so that we can conduct more neighborhood-related analysis. + +```{r} +tx_spe_sample_1 = tx_spe_sample_1 |> mergeHoodSpe(hoods) +``` + +We can see what are the neighborhood distributions look like in each cluster using `plotProbDist.` + +```{r, fig.width=7, fig.height=8} +tx_spe_sample_1 |> + plotProbDist( + pm_cols = colnames(hoods), + by_cluster = TRUE, + plot_all = TRUE, + show_clusters = as.character(seq(10)) + ) +``` + + +### 6. Supplementary: `MoleculeExperiment` package + +https://www.bioconductor.org/packages/release/bioc/vignettes/MoleculeExperiment/inst/doc/MoleculeExperiment.html + +```{r, fig.width=7, fig.height=8} + +library(MoleculeExperiment) + +repoDir = system.file("extdata", package = "MoleculeExperiment") +repoDir = paste0(repoDir, "/xenium_V1_FF_Mouse_Brain") + +me = readXenium(repoDir, keepCols = "essential") +me + + +repoDir = system.file("extdata", package = "MoleculeExperiment") +repoDir = paste0(repoDir, "/xenium_V1_FF_Mouse_Brain") +boundaries(me, "nucleus") = readBoundaries( + dataDir = repoDir, + pattern = "nucleus_boundaries.csv", + segmentIDCol = "cell_id", + xCol = "vertex_x", + yCol = "vertex_y", + keepCols = "essential", + boundariesAssay = "nucleus", + scaleFactorVector = 1 +) + + + +boundaries(me, "cell") +showMolecules(me) + +bds_colours = setNames( + c("#aa0000ff", "#ffaaffff"), + c("Region 1", "Region 2") +) + +ggplot_me() + + # add cell segments and colour by cell id + geom_polygon_me(me, byFill = "segment_id", colour = "black", alpha = 0.1) + + # add molecule points and colour by feature name + geom_point_me(me, byColour = "feature_id", size = 0.1) + + # add nuclei segments and colour the border with red + geom_polygon_me(me, assayName = "nucleus", fill = NA, colour = "red") + + # zoom in to selected patch area + coord_cartesian(xlim = c(4900, 4919.98), ylim = c(6400.02, 6420)) +``` + + +**Session Information** + +```{r} +sessionInfo() +``` + +**References** + +```{css echo=FALSE} +.note { + margin: 30px; + padding: 1em; + background: #FFF8F0; + border: 1px solid #EFE8E0; + border-radius: 10px; +} +``` diff --git a/vignettes/tidyGenomicsTranscriptomics.Rmd b/vignettes/tidyGenomicsTranscriptomics.Rmd deleted file mode 100644 index 7164cf6..0000000 --- a/vignettes/tidyGenomicsTranscriptomics.Rmd +++ /dev/null @@ -1,778 +0,0 @@ ---- -title: "Tidy genomic and transcriptomic single-cell analyses" -author: - - Maria Doyle, Peter MacCallum Cancer Centre^[] - - Stefano Mangiola, Walter and Eliza Hall Institute^[] - - Michael Love, UNC-Chapel Hill^[] -output: rmarkdown::html_vignette -bibliography: "`r file.path(system.file(package='tidySpatialWorkshop2024', 'vignettes'), 'tidyomics.bib')`" -vignette: > - %\VignetteIndexEntry{Tidy genomic and transcriptomic single-cell analyses} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE, cache = FALSE) -``` - -## Instructors - -*Dr. Stefano Mangiola* is currently a Postdoctoral researcher in the -laboratory of Prof. Tony Papenfuss at the Walter and Eliza Hall -Institute in Melbourne, Australia. His background spans from -biotechnology to bioinformatics and biostatistics. His research -focuses on prostate and breast tumour microenvironment, the -development of statistical models for the analysis of RNA sequencing -data, and data analysis and visualisation interfaces. - -*Dr. Michael Love* is an Associate Professor at UNC-Chapel Hill in the -Department of Genetics and the Department of Biostatistics. His -background is in Statistics and Computational Biology. His research -is highly collaborative, working with Geneticists, Molecular -Biologists, Epidemiologists, and Computer Scientists on methods for -transcriptomics, epigenetics, GWAS, and high-throughput functional -screens. - -## Workshop goals and objectives - -### What you will learn - -- Basic `tidy` operations possible with `tidySingleCellExperiment` - and `GRanges` -- How to interface `SingleCellExperiment` with tidy manipulation and - visualisation -- A real-world case study that will showcase the power of `tidy` - single-cell methods compared with base/ad-hoc methods -- Examples of how to integrate genomic and transcriptomic data - (ChIP-seq and RNA-seq) - -### What you will *not* learn - -- The molecular technology of single-cell sequencing -- The fundamentals of single-cell data analysis -- The fundamentals of tidy data analysis -- Detailed data integration methods (multi-view or multi-omics - algorithms) - -## Getting started - -### Local - -We will use the Cloud during the workshop and this method is available -if you want to run the material after the workshop. If you want to -install on your own computer, see instructions -[here](https://tidyomics.github.io/tidySpatialWorkshop2024/index.html#workshop-package-installation). - -Alternatively, you can view the material at the workshop webpage -[here](https://tidyomics.github.io/tidySpatialWorkshop2024/articles/main.html). - -## Introduction slides - -We provide two links to introductory material (to consult outside of -the workshop slot): - -1. [Introduction to tidytranscriptomics](https://docs.google.com/gview?url=https://raw.githubusercontent.com/tidyomics/tidySpatialWorkshop2024/master/inst/tidytranscriptomics_slides.pdf) -2. [Introduction to tidy genomics](https://github.com/tidyomics/tidy-genomics-talk/blob/main/tidy-genomics-talk.pdf) - -# Part 1 Introduction to tidySingleCellExperiment - -```{r message = FALSE} -# Load packages -library(SingleCellExperiment) -library(ggplot2) -library(plotly) -library(dplyr) -library(colorspace) -library(dittoSeq) -``` - -SingleCellExperiment is a popular data object in the Bioconductor -ecosystem for representing single-cell datasets, which works -seamlessly with numerous Bioconductor packages and methods written by -different groups [@Amezquita2020]. It can easily be converted to and -from other formats such as SeuratObject and scverse's AnnData. - -Here we load single-cell data in SingleCellExperiment object -format. This data is peripheral blood mononuclear cells (PBMCs) from -metastatic breast cancer patients. - -```{r} -# load single cell RNA sequencing data -sce_obj <- tidySpatialWorkshop2024::sce_obj - -# take a look -sce_obj -``` - -tidySingleCellExperiment provides a bridge between the -SingleCellExperiment single-cell package and the tidyverse -[@wickham2019welcome]. It creates an invisible layer that enables -viewing the SingleCellExperiment object as a tidyverse tibble, and -provides SingleCellExperiment-compatible *dplyr*, *tidyr*, *ggplot* -and *plotly* functions. - -For related work for bulk RNA-seq, see the -[tidybulk](https://stemangiola.github.io/tidybulk/) package [@Mangiola2021]. - -If we load the *tidySingleCellExperiment* package and then view the -single cell data, it now displays as a tibble. - -```{r message = FALSE} -library(tidySingleCellExperiment) - -sce_obj -``` - -If we want to revert to the standard SingleCellExperiment view we can do that. - -```{r} -options("restore_SingleCellExperiment_show" = TRUE) -sce_obj -``` - -If we want to revert back to tidy SingleCellExperiment view we can. - -```{r} -options("restore_SingleCellExperiment_show" = FALSE) -sce_obj -``` - -It can be interacted with using -[SingleCellExperiment commands](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) -such as `assays`. - -```{r} -assays(sce_obj) -``` - -We can also interact with our object as we do with any tidyverse tibble. - -## Tidyverse commands - -We can use tidyverse commands, such as `filter`, `select` and `mutate` -to explore the tidySingleCellExperiment object. Some examples are -shown below and more can be seen at the tidySingleCellExperiment -website -[here](https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html#tidyverse-commands-1). - -We can use `filter` to choose rows, for example, to see just the rows for the cells in G1 cell-cycle stage. - -```{r} -sce_obj |> filter(Phase == "G1") -``` - -:::: {.note} -Note that *rows* in this context refers to rows of the abstraction, -not *rows* of the SingleCellExperiment which correspond to -genes. *tidySingleCellExperiment* prioritizes cells as the units of -observation in the abstraction, while the full dataset, including -measurements of expression of all genes, is still available "in the -background". -:::: - -We can use `select` to view columns, for example, to see the filename, -total cellular RNA abundance and cell phase. - -If we use `select` we will also get any view-only columns returned, -such as the UMAP columns generated during the preprocessing. - -```{r} -sce_obj |> select(.cell, file, nCount_RNA, Phase) -``` - -We can use `mutate` to create a column. For example, we could create a -new `Phase_l` column that contains a lower-case version of `Phase`. - -```{r message=FALSE} -sce_obj |> - mutate(Phase_l = tolower(Phase)) |> - select(.cell, Phase, Phase_l) -``` - -We can use tidyverse commands to polish an annotation column. We will -extract the sample, and group information from the file name column -into separate columns. - -```{r message=FALSE} -# First take a look at the file column -sce_obj |> select(.cell, file) -``` - -```{r} -# Create column for sample -sce_obj <- sce_obj |> - # Extract sample - extract(file, "sample", "../data/.*/([a-zA-Z0-9_-]+)/outs.+", remove = FALSE) - -# Take a look -sce_obj |> select(.cell, sample, everything()) -``` - -We could use tidyverse `unite` to combine columns, for example to -create a new column for sample id combining the sample and patient id -(BCB) columns. - -```{r message=FALSE} -sce_obj <- sce_obj |> unite("sample_id", sample, BCB, remove = FALSE) - -# Take a look -sce_obj |> select(.cell, sample_id, sample, BCB) -``` - - -# Part 2 Signature visualisation - -## Data pre-processing - -The object `sce_obj` we've been using was created as part of a study -on breast cancer systemic immune response. Peripheral blood -mononuclear cells have been sequenced for RNA at the single-cell -level. The steps used to generate the object are summarised below. - -- `scran`, `scater`, and `DropletsUtils` packages have been used to - eliminate empty droplets and dead cells. Samples were individually - quality checked and cells were filtered for good gene coverage . - -- Variable features were identified using `modelGeneVar`. - -- Read counts were scaled and normalised using logNormCounts from `scuttle`. - -- Data integration was performed using `fastMNN` with default parameters. - -- PCA performed to reduce feature dimensionality. - -- Nearest-neighbor cell networks were calculated using 30 principal components. - -- 2 UMAP dimensions were calculated using 30 principal components. - -- Cells with similar transcriptome profiles were grouped into clusters using Louvain clustering from `scran`. - -## Analyse custom signature - -The researcher analysing this dataset wanted to identify gamma delta T -cells using a gene signature from a published paper -[@Pizzolato2019]. We'll show how that can be done here. - -With tidySingleCellExperiment's `join_features` we can view the counts -for genes in the signature as columns joined to our single cell -tibble. - -```{r} -sce_obj |> - join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide") -``` - -We can use tidyverse `mutate` to create a column containing the -signature score. To generate the score, we scale the sum of the 4 -genes, CD3D, TRDC, TRGC1, TRGC2, and subtract the scaled sum of the 2 -genes, CD8A and CD8B. `mutate` is powerful in enabling us to perform -complex arithmetic operations easily. - -```{r} -sce_scored <- sce_obj |> - - join_features(c("CD3D", "TRDC", "TRGC1", "TRGC2", "CD8A", "CD8B"), shape = "wide") |> - - mutate( - signature_score = - scales::rescale(CD3D + TRDC + TRGC1 + TRGC2, to = c(0, 1)) - - scales::rescale(CD8A + CD8B, to = c(0, 1)) - ) - -sce_scored |> - - select(.cell, signature_score, everything()) -``` - -The gamma delta T cells could then be visualised by the signature -score using Bioconductor's visualisation functions. - -```{r} -sce_scored |> - - scater::plotUMAP(colour_by = "signature_score") -``` - -The cells could also be visualised using the popular and powerful -`ggplot2` package, enabling the researcher to use ggplot functions -they were familiar with, and to customise the plot with great -flexibility. - -```{r} -sce_scored |> - - # plot cells with high score last so they're not obscured by other cells - arrange(signature_score) |> - - ggplot(aes(UMAP_1, UMAP_2, color = signature_score)) + - geom_point() + - scale_color_distiller(palette = "Spectral") + - tidySpatialWorkshop2024::theme_multipanel -``` - -For exploratory analyses, we can select the gamma delta T cells, the -red cluster on the left with high signature score. We'll filter for -cells with a signature score > 0.7. - -```{r} -sce_obj_gamma_delta <- - - sce_scored |> - - # Proper cluster selection should be used instead (see supplementary material) - filter(signature_score > 0.7) -``` - -For comparison, we show the alternative using base R and -SingleCellExperiment. Note that the code contains more redundancy and -intermediate objects. - -```{r eval=FALSE} -counts_positive <- - assay(sce_obj, "logcounts")[c("CD3D", "TRDC", "TRGC1", "TRGC2"), ] |> - colSums() |> - scales::rescale(to = c(0, 1)) - -counts_negative <- - assay(sce_obj, "logcounts")[c("CD8A", "CD8B"), ] |> - colSums() |> - scales::rescale(to = c(0, 1)) - -sce_obj$signature_score <- counts_positive - counts_negative - -sce_obj_gamma_delta <- sce_obj[, sce_obj$signature_score > 0.7] -``` - -We can then focus on just these gamma delta T cells and chain -Bioconductor and tidyverse commands together to analyse. - -```{r warning=FALSE, message=FALSE} -library(batchelor) -library(scater) - -sce_obj_gamma_delta <- - - sce_obj_gamma_delta |> - - # Integrate - using batchelor. - multiBatchNorm(batch = colData(sce_obj_gamma_delta)$sample) |> - fastMNN(batch = colData(sce_obj_gamma_delta)$sample) |> - - # Join metadata removed by fastMNN - using tidyverse - left_join(as_tibble(sce_obj_gamma_delta)) |> - - # Dimension reduction - using scater - runUMAP(ncomponents = 2, dimred = "corrected") -``` - -Visualise gamma delta T cells. As we have used rough threshold we are -left with only few cells. Proper cluster selection should be used -instead (see supplementary material). - -```{r} -sce_obj_gamma_delta |> plotUMAP() -``` - - -It is also possible to visualise the cells as a 3D plot using plotly. -The example data used here only contains a few genes, for the sake of -time and size in this demonstration, but below is how you could -generate the 3 dimensions needed for 3D plot with a full dataset. - -```{r eval = FALSE} -single_cell_object |> - RunUMAP(dims = 1:30, n.components = 3L, spread = 0.5, min.dist = 0.01, n.neighbors = 10L) -``` - -We'll demonstrate creating a 3D plot using some data that has 3 UMAP -dimensions. This is a fantastic way to visualise both reduced -dimensions and metadata in the same representation. - -```{r umap plot 2, message = FALSE, warning = FALSE} -pbmc <- tidySpatialWorkshop2024::sce_obj_UMAP3 - -pbmc |> - plot_ly( - x = ~`UMAP_1`, - y = ~`UMAP_2`, - z = ~`UMAP_3`, - color = ~cell_type, - colors = dittoSeq::dittoColors() - ) %>% - add_markers(size = I(1)) -``` - -## Exercises - -Using the `sce_obj`: - -1. What proportion of all cells are gamma-delta T cells? Use - signature_score > 0.7 to identify gamma-delta T cells. - -2. There is a cluster of cells characterised by a low RNA output - (nCount_RNA < 100). Identify the cell composition (cell_type) of - that cluster. - -# Part 3 Genomic and transcriptomic data integration - -So far we have examined expression of genes across our cells, but we -haven't considered the genomic location of the genes. In this section -we will operate on genomic location information using the *plyranges* -[@Lee2019] package, and tie this to the single cell transcriptomics -data we have been examining. - -*plyranges* provides tidyverse-style functionality to genomic ranges -(GRanges objects [@Lawrence2013]) analogous to how -*tidySingleCellExperiment* provides functionality for -SingleCellExperiment objects. For an example of a workflow using -*plyranges* as part of a bulk RNA-seq analysis, see @Lee2020. - -:::: {.note} -In some pipelines (e.g. using *tximeta* [@Love2020] to import bulk or -single-cell quantification data) our SingleCellExperiment or -SummarizedExperiment would already have genomic range information -attached to the rows of the object. That is, we would have information -about the starts and ends of the genes, their strand information, even -the lengths of the chromosomes and the genome build, etc. -:::: - -Before we start our integrative analysis, we will first take some -steps to add hg38 range information on our genes, matching by the gene -symbol provided on the rows of `sce_obj`. We add genes from the -*ensembldb* package [@Rainer2019] (to see how to add a particular -version of Ensembl, see the package vignette). - -```{r message=FALSE} -# what our rownames look like: gene symbols -sce_obj |> rownames() |> head() - -# we recommend Ensembl or GENCODE gene annotations -edb <- EnsDb.Hsapiens.v86::EnsDb.Hsapiens.v86 # hg38 -g <- ensembldb::genes(edb) -head(genome(g)) -``` - -We can examine the first three genes and their metadata: - -```{r message=FALSE} -library(plyranges) -g |> slice(1:3) -``` - -Our first task is to subset `g`, the human genes, to the ones in our -SingleCellExperiment. While Ensembl IDs are unique, gene symbols are -not, so we will have to also remove any duplicate gene symbols -(recommend working with Ensembl IDs as feature identifiers when -possible). - -We subset the columns of `g` to just the `symbol` column, using the -`select()` function. *plyranges* enables us to use familiar tidy verbs -on GRanges objects, e.g. to `filter` rows or to `select` columns of -metadata attached to the ranges. - -```{r} -g <- g |> - - select(symbol) - -g |> slice(1:3) -``` - -Now we filter the genes of our SingleCellExperiment to only those -present as rownames of `sce_obj`, and remove duplicate IDs. We then -sort by genomic location, and use the gene symbols as names of the -ranges. - -```{r} -gene_names <- sce_obj |> rownames() - -g <- g |> - - filter(symbol %in% gene_names) |> - filter(!duplicated(symbol)) |> - sort() # genomic position sorting - -names(g) <- g$symbol -``` - -Now we can add our gene information to the rows of the -`sce_obj`. - -:::: {.note} -Again, these following steps would not be necessary using a -pipeline that imports quantification data to ranged objects, but it -isn't too hard to add manually. If adding ranges manually, make note -of the provenance of where you downloaded the information (Ensembl, -GENCODE or otherwise), and assign a genome build if you plan on -sharing the final object (e.g. with `genome(x) <- "..."`). -:::: - -```{r} -all(names(g) %in% rownames(sce_obj)) -sce_sub <- sce_obj[ names(g), ] # put SCE in order of `g` -rowRanges(sce_sub) <- g # assign ranges `g` to the SCE -sce_sub |> rowRanges() # peek at the ranges -``` - -Now let's do something interesting with the gene ranges: let's see if -genes near peaks of active chromatin marks (H3K4me3 measured with -ChIP-seq) in another experiment involving PBMC have a difference in -their expression level compared to other genes. - -There are many sources of epigenetic tracks, ENCODE, Roadmap, -etc. Here we will use some ENCODE [@ENCODE2012] data available in -*AnnotationHub* on Bioconductor. - -:::: {.note} -We had to do a bit of book-keeping first. These peaks were in -the hg19 genome build, so we have ran the following commands (here -un-evaluated) to 'lift" the peaks to the hg38 genome build, and -provide proper sequence information. -:::: - -```{r eval=FALSE} -### un-evaluated code chunk ### -# AnnotationHub contains many useful genome tracks -library(AnnotationHub) -ah <- AnnotationHub() -# can be queried with keywords -query(ah, c("Peripheral_Blood","h3k4me3")) -# downloading a particular resource -peaks <- ah[["AH44823"]] -library(rtracklayer) -# lifting hg19 to hg38 -new_peaks <- unlist( - liftOver( - peaks, - chain = import.chain("~/Downloads/hg19ToHg38.over.chain") - ) -) -# Ensembl-style chrom names -seqlevelsStyle(new_peaks) <- "NCBI" -# bring over any missing chroms -seqlevels(new_peaks) <- seqlevels(sce_sub) -# bring over chrom lengths -seqinfo(new_peaks) <- seqinfo(sce_sub) -# subset to a few columns -pbmc_h3k4me3_hg38 <- new_peaks |> - select(signalValue, qValue, peak) -# save for reloading below -save(pbmc_h3k4me3_hg38, file="data/pbmc_h3k4me3_hg38.rda", compress="xz") -``` - -Loading those peaks. - -```{r} -# loading those H3K4me3 peaks for PBMCs -peaks <- tidySpatialWorkshop2024::pbmc_h3k4me3_hg38 -plot(peaks$qValue, type="l", ylab="q-value", main="H3K4me3 peaks") -abline(v=5000, lty=2) -``` - -We will arbitrarily chose to take the top 5,000 peaks by p-value. How -to chose the number of epigenetic peaks and genes to consider in a -multi-omics analysis is a more involved question, and is worth -considering the impact of such a choice for enrichment analysis. - -```{r} -peaks <- peaks |> - - slice(1:5000) |> # ordered already by p-value (most to least sig) - sort() # genomic position sorting -``` - -It is easy to compute the distance of the genes in our -SingleCellExperiment to the nearest H3K4me3 peak, using *plyranges* -convenience functions. - -See the -[plyranges package website](https://sa-lee.github.io/plyranges/reference/index.html) -for a full listing of such functions. - -```{r} -dists <- rowRanges(sce_sub) |> - - anchor_5p() |> # anchor 5 prime end - mutate(width=1) |> # shrink range to 1bp - add_nearest_distance(peaks) - -hist(log10(dists$distance + 1), breaks=40, - main="gene-to-peak distance", xlab="log10 distance") -``` - -Let's put the genes into 4 different bins by their distance to a -H3K4me3 peak: 1) overlapping [distance of 0], 2) 1bp-10kb, -3) 10kb-100kb, or 4) farther than 100kb. - -```{r} -bins <- c(0,1,1e4,1e5,Inf) - -dists <- dists |> - - select(symbol, distance, .drop_ranges = TRUE) |> # remove chr/pos/strand etc. - as_tibble() |> - mutate(distance_bin = cut(distance, bins, include.lowest = TRUE)) |> - rename(feature = symbol) # this will help us add information to the SCE -``` - -We can see how many genes are in which category: - -```{r} -dists |> - - ggplot(aes(distance_bin)) + - geom_bar() -``` - -We will now take the SingleCellExperiment data and compress it down to -a SummarizedExperiment using the `aggregate_cells` function from -*tidySingleCellExperiment*. After this function, every column of the -SummarizedExperiment is a cell type. - -We immediately pipe the SummarizedExperiment into a `left_join` where -we add the distance-to-peak information, and then into a `nest` -command where we create a nested (tidy)SummarizedExperiment object, -where we have grouped by the distance bin that the genes fall into. - - -```{r message=FALSE} -library(tidySummarizedExperiment) -library(purrr) -nested <- sce_sub |> - - aggregate_cells(cell_type) |> - left_join(dists, by="feature") |> - nest(se = -distance_bin) - -nested -``` - -We can now operate on the SummarizedExperiment objects that are within -`nested`. For example, we can extract the column means of the counts -(cell-type-specific means of counts over genes). - -We save this as a new tibble called `smry` as it contains summary -information. - -```{r} -smry <- nested |> - - mutate(summary = map(se, \(se) { - tibble(count_mean = colMeans(assay(se)), - cell_type = se$cell_type) - }) - ) |> - select(-se) |> - unnest(summary) |> - mutate(cell = substr(cell_type, 1, 13)) # abbreviate cell_type for plot -``` - -As a final plot, we can look at the mean count over the genes in a -particular distance bin (x-axis), across the cell types (y-axis). The -different lines show how the genes' proximity to H3K4me3 peaks affects -the average count. - -```{r} -library(forcats) -smry |> - - mutate(cell = fct_reorder(cell, count_mean, .fun=median)) |> - ggplot(aes(count_mean, cell, color=distance_bin, group=distance_bin)) + - geom_point() + - geom_line(orientation="y") + - xlab("mean of count over genes in bin") + - ylab("cell type") -``` - -We finish this section with a re-cap of the steps we took: - -1. calculated distances from genes to H3K4me3 peaks -2. added distance information onto the SingleCellExperiment -3. condensed the dataset via pseudobulking -4. computed the mean count (over genes) within 4 bins of genes -5. visualized these mean counts over cell types - -Let's consider a number of issues with this first-pass analysis and -visualization: - -1. we arbitrarily thresholded the peaks at the beginning of the - analysis to take the top 5,000 -2. we just computed the mean count, not taking into account total - reads per cell type (sequencing depth per cell x number of cells) -3. we are comparing cell-type-specific expression to aggregate - ChIP-seq peaks in PBMC -4. the two experiments, while both PBMC, come from different projects, - different labs, and one is from cancer patients, while the other is - from the ENCODE project -5. we only plot the mean count over genes, perhaps it would be better - to show more information about the distribution - -:::: {.note} -As an alternative to (1), we could have counted the number of peaks -that were near each genes, or even computed on the metadata of the -peaks (signal value, etc.). An example follows of how we could have -counted overlaps instead (how many peaks within 20kb). -:::: - -```{r} -overlaps <- rowRanges(sce_sub) |> - - anchor_5p() |> - mutate(width=1) %>% - mutate(num_overlaps = count_overlaps(., peaks, maxgap=2e4)) - -table(overlaps$num_overlaps) -``` - -:::: {.note} -Another question that often arises is whether the distances or -overlaps we observe are more or less than we would expect if there -were no relationship between the positions of genes and peaks. -To answer questions like these, we have developed the *nullranges* -package, which allows for either bootstrapping of ranges throughout -the genome [@Mu2023], or selection of a covariate-matched background -set [@Davis2023], to compute probabilities under the null hypothesis. -A brief example follows of bootstrapping some of the peaks in a window -of chr1:20Mb-30Mb (see *nullranges* website for more comprehensive -examples). -:::: - -```{r} -library(nullranges) - -# make a small range for demonstration of bootstrapping -seg <- data.frame(seqnames=1, start=20e6, width=10e6) |> - as_granges() - -# generate a genome segmentation from this one range -seg <- oneRegionSegment(seg, seqlength=seqlengths(peaks)[[1]]) -seg - -# bootstrap peaks within this segment (just for demo) -peaks |> - filter_by_overlaps(seg[2]) |> - bootRanges(blockLength=1e6, seg=seg, R=10) - -# these bootstrapped ranges could then be used for computing -# generic test statistics, with the `iter` column used for -# building a null distribution -``` - -**Session Information** - -```{r} -sessionInfo() -``` - -**References** - -```{css echo=FALSE} -.note { - margin: 30px; - padding: 1em; - background: #FFF8F0; - border: 1px solid #EFE8E0; - border-radius: 10px; -} -```