This repository contains scripts used for generation of the meta-analyzed reference and other downstream analyses currently included in the manuscript. The repository can be broadly divided into 3 main sections: 1) QC and processing of each sequencing library 2) Integration benchmark, building of reference and annotations 3) Downstream analyses
All analyses were carried using R (v.4.0.3) and python (v.3.8)
The following packages are required to run the scripts:
Single cell data processing and integration:
- Seurat (v.4.1.0)
- celda (v.1.6.1)
- scDblFinder (v.1.4.0)
- tidyverse (v.1.3.1)
- data.table (v.1.14.2)
- reticulate (v.1.18)
- RColorBrewer (v.1.1.2)
- ggsci (v.2.9)
- scanorama (v.1.7.1)
- lisi (v.1.0)
- cluster (v.2.1.0)
- SeuratDisk (v.0.0.0.9019)
- celltypist (v.1.0.0)
LDSC workflow:
- numpy (v.1.19.2)
- scipy (v.1.5.2)
- pandas (v.1.4.2)
- cellex (v.1.2.2)
- matplotlib (v.3.5.2)
- snakemake (v.6.0.5)
Downstream analyses:
- monocle3 (v.1.0.0)
- CellChat (v.1.5.0)
- UCell (v.1.3.1)
- dorothea (v.1.8.0)
- biomaRt (v.2.46.0)
- viper (v.1.24.0)
- gprofiler2 (v.0.2.1)
- ggridges (v.0.5.3)
- stats (v.4.0.3)
Scripts used for QC and normalization of libraries can be found in the following dirs:
raw_counts_processing
As detailed in the manuscript, the main criteria for processing of each of the libraries includes:
- removal of doublets
- removal of ambient RNA
- Number of unique UMIs and unique genes expressed. % of reads mapped to the mitochondrial genome, Hb genes.
- dimensionality reduction
Scripts used for the benchmark and generation of the integrated reference were included in the following dirs:
integration_benchmarking
Main criteria/metrics calculated in the benchmark include:
- Running time
- Local Inverse Simpson Index (LISI):
- iLISI (measures degree of batch mixing)
- cLISI (measures separation of unique cell types)
- Silhouette scores calculated across 0.8-1.8 resolutions
Scripts used for building the reference were included in the following dirs:
ref_integration
automated_annotations
Level 1 annotations were done through Transfer Learning using the Tabula Sapiens cardiovascular subset as reference. T. Sapiens data was redownloaded and reprocessed using SCTransform normalization to match processing and normalization parameters of the built reference.
Level 2 annotations were done through a combination of automated annotation and manual curation. For the automated portion we used celltypist with 4 different models as follows:
#!/bin/bash
# Annotate cell types based on Immune_All_AddPIP model (no majority voting)
celltypist --indata /project/cphg-millerlab/Jose/human_scRNA_meta_analysis/rds_objects/integration_rds_objects/rPCA/alsaigh_pan_wirka_hu_int/CELLECT_inputs/rpca_int_sct_v3_norm_counts.csv \
--model Immune_All_AddPIP.pkl \
--outdir /project/cphg-millerlab/Jose/human_scRNA_meta_analysis/rds_objects/integration_rds_objects/rPCA/alsaigh_pan_wirka_hu_int/celltypist_outputs/Immune_All_AddPIP \
--prefix celltypist_All_Immune_AddPIP_ \
--transpose-input \
--plot-results
Donwstream analyses including annotation and characterization of SMC phenotypes are included in the following dirs:
SMC_analyses
S-LDSC
Characterization of SMC modulated phenotypes includes:
- Enrichment of murine SMC gene signatures
- Pseudotime analyses
- Transcription Factor (TF activity inference) using human DoRothEA regulons
For LD score regression applied to specifically expressed genes (LDSC-SEG) analyses, we first generated a gene expression specificity matrix using SCTransformed-normalized gene expression for level 1 and 2 annotations using CELLEX. GWAS summary statistics were munged as follows:
#!/bin/bash
# Run the mtag_munge script to make sure that GWAS summary stats are correctly formatted for LDSC
python /project/cphg-millerlab/Jose/CELLECT/ldsc/mtag_munge.py \
--sumstats /project/cphg-millerlab/Jose/GWAS_summary_stats/GWAS_pulse_pressure_MVP/pp_MVP_summary_for_ldsc.txt.gz \
--merge-alleles /project/cphg-millerlab/Jose/CELLECT/data/ldsc/w_hm3.snplist \
--out /project/cphg-millerlab/Jose/human_scRNA_meta_analysis/CELLECT_analyses/munged_GWAS_summary_stats/pulse_pressure_MVP_GWAS/munged_pulse_pressure_MVP \
--n-value 300000 \
--keep-pval \
--p PVAL \
--chunksize 500000
Prior integration with GWAS summary statistics, we defined a cellect_config_multi_GWAS.yml
file with attributes pointing to munged GWAS summary statistics and the gene expression specificity matrix. We then ran the LDSC snakemake workflow as follows:
#!/bin/bash
# Load modules. Looks like python3.8 is needed for the script to finish running but need to do a few more tests.
module load anaconda/2020.11-py3.8
module load bzip2/1.0.6
module load snakemake/6.0.5
# Run ldsc analysis using CELLECT wrapper.
# Before running this command, it might be necessary to gunzip the gene expression specificity file. Somehow it seems to help make things run more smoothly.
snakemake --use-conda -j -s /project/cphg-millerlab/Jose/CELLECT/cellect-ldsc.snakefile --configfile cellect_config_multi_GWAS.yml