scATAC-seq: Comparing chromatin accessibility in hematopoietic stem cells (HSCs) of young vs aged mice
I wrote this R/Bioconductor script analysis pipeline for single-cell ATAC-seq (scATAC-seq) data looking at differential accessibility in hematopoietic stem cells (HSCs) from 10-week-old 'young' mice vs 20-month-old 'aged' mice. The data is from the NCBI Gene Expression Omnibus (GEO) repository and is obtained by running the Cell Ranger ATAC pipeline which carries out the following steps:
- demultiplexing of raw BCL base call files into FASTQ files
- read alignment
- barcode counting
- peak calling with reference “refdata-cellranger-arc-mm10-2020-A-2.0.0”,
- outputs: BED file of peaks, CSV file with cell barcodes metadata, TSV/BED file of each unique fragment and associated cell barcode, etc
- Data Loading and Preprocessing: It loads necessary R packages (Signac, Seurat, tidyverse, etc.) and reads in BED peak files, CSV cell barcodes metadata, and TSV/BED fragment files for young and aged single-cell ATAC-seq datasets.
- Peak Filtering and Common Set Creation: It identifies a common set of peaks by reducing peaks from individual datasets and filters out peaks based on width criteria.
- Initial Quality Control: Filters out low-quality cells based on specific cutoffs for various quality control metrics.
- Count Matrix Generation: Generates count matrices for both young and aged datasets based on the common peak set.
- Seurat Object Creation: Constructs Seurat objects for each dataset.
- Integration and Dimensionality Reduction: Integrates datasets, performs TF-IDF normalization followed by SVD, and conducts UMAP-based visualization and clustering.
- Gene Annotation and Analysis: Extracts gene annotations, adds them to the Seurat object, and conducts gene activity analysis.
- Normalization and Visualization of "RNA" Data: Normalizes gene activity "RNA" data, visualizes canonical marker genes, and identifies differential peaks between young and aged datasets.
- Visualization of Peaks: Visualizes coverage plots for selected peaks and their closest genomic features.
- Individual peak files for young and aged datasets (
youngPeaks.txt
,agedPeaks.txt
) - Metadata files for young and aged datasets (
GSM5723631_Young_HSC_singlecell.csv
,GSM5723632_Aged_HSC_singlecell.csv
) - Fragment files for young and aged datasets (
GSM5723631_Young_HSC_fragments.tsv.gz
,GSM5723632_Aged_HSC_fragments.tsv.gz
) - Annotation file (
EnsDb.Mmusculus.v79
)
- Signac
- Seurat
- tidyverse
- patchwork
- GenomicRanges
- future
- EnsDb.Mmusculus.v79 (annotation)
- Seurat objects (
combined
,young
,aged
) containing integrated and processed single-cell ATAC-seq data. - Visualizations: Various plots for quality control, gene activity, differential peaks, UMAP visualization, coverage plots for peaks, and more.