BREMSC is an R package (with core functions jointDIMMSC and BREMSC) for joint clustering droplet-based scRNA-seq and CITE-seq data. jointDIMMSC is developed as a direct extension of DIMMSC, which assumes full indenpendency between single cell RNA and surface protein data. To take the correlation between two data sources into consideration, we further develop BREMSC, which uses random effects to incorporate the two data sources. This package can directly work on raw count data from droplet-based scRNA-seq and CITE-seq experiments without any data transformation, and it can provide clustering uncertainty for each cell.
Version: 0.2.0 (Date: 2020-03-02)
Install BREMSC from Github
install.packages("devtools")
library(devtools)
install_github("tarot0410/BREMSC")
Or terminal command (first download BREMSC source file from Wei Chen's Lab website)
R CMD INSTALL BREMSC_0.1.0.tar
jointDIMMSC is developed as an extension of DIMMSC, which assumes full indenpendency between single cell RNA and surface protein data. We construct the joint likelihood of the two data sources as their product, and use EM algorithm for parameter inference. In practice, the computational speed for jointDIMMSC is much faster than BREMSC, but the model assumption is more stringent.
jointDIMMSC(dataProtein, dataRNA, K, useGene = 100, maxiter = 100, tol = 1e-04, lik.tol = 0.01)
- dataProtein : a D*C matrix with D proteins and C cells
- dataRNA : a G*C matrix with G genes and C cells
- K : number of desired clusters
- useGene : number of genes to keep
- maxiter : maximum number of iterations
- tol : a convergence tolerance for the difference of vector pie between iterations
- lik.tol : a convergence tolerance for the difference of log-likelihoods between iterations
jointDIMMSC returns a list object containing:
- clusterID : estimated label for each cell
- posteriorProb : a C*K matrix with probability that each cell belongs to each cluster
- logLik : estimated log likelihood
- alphaMtxProtein : a K*D matrix of alpha estimates for protein source
- alphaMtxRNA : a K*G matrix of alpha estimates for RNA source
# First load BREMSC R package
library(BREMSC)
# Next load the example simulated data (dataADT: protein data; dataRNA: RNA data)
data("dataADT")
data("dataRNA")
# Test run of jointDIMMSC
testRun <- jointDIMMSC(dataADT, dataRNA, K=4)
Similar to jointDIMMSC, BREMSC uses separate Dirichlet mixture priors to characterize variations across cell types for each data source, but it further uses random effects to incorporate the two data sources. A Bayesian framework with Gibbs-sampling is used for parameter estimation. The computational speed for BREMSC is much slower than jointDIMMSC. In practice, nMCMC>500 is suggested in real application, and running with more than 3 chains (set as a parameter) are strongly recommended for better stability. Also, a prescreening of RNA features is necessary to help reduce time and noise for BREM-SC. It is recommended to use less than 1000 RNA features.
BREMSC(dataProtein, dataRNA, K, nChains = 3, nMCMC = 1000, sd_alpha = c(0.5, 1.5), sd_b = c(0.2, 1), sigmaB = 0.8)
- dataProtein : a D*C matrix with D proteins and C cells
- dataRNA : a G*C matrix with G genes and C cells, recommended to do prescreening of RNA features (<=1000) before running BREM-SC
- K : number of desired clusters
- nChains : number of chains to run in parallel (need to have >= nChains number of threads available), 3-5 is recommended, default 3
- nMCMC : number of MCMCs, default 1000
- sd_alpha_range : range of random walk depth for alpha, each chain uses a different depth, default c(0.5, 1.5)
- sd_b_range : range of random walk depth for b, each chain uses a different depth, default c(0.2, 1)
- sigmaB : initial estimate of SD of random effects, default 0.8
BREMSC returns a list object containing:
- clusterID : estimated label for each cell
- posteriorProb : a C*K matrix with probability that each cell belongs to each cluster
- sdRF_final : estimated SD of random effects from last MCMC
- logLik_final : estimated log likelihood from last MCMC
- vecB : estimated random effects for each cell
- alphaMtxProtein : a K*D matrix of alpha estimates for protein source
- alphaMtxRNA : a K*G matrix of alpha estimates for RNA source
- vecLogLik : a vector of log likelihood for each MCMC (used for convergence check!)
- vecSDRF : a vector of estimated SD of random effects from each MCMC
# First load BREMSC R package
library(BREMSC)
# Next load the example simulated data (dataADT: protein data; dataRNA: RNA data)
data("dataADT")
data("dataRNA")
# Test run of BREMSC (using small number of MCMC here to save time)
testRun <- BREMSC(dataADT, dataRNA, K=4, nChains=2, nMCMC=100)
# Check convergence of log likelihoos
plot(testRun$vecLogLik, type = "l", xlab = "MCMC Iterations", ylab = "Log likelihood") # consider to increase the number of MCMCs if the log likelihood doesn't look like converged
- Xinjun Wang, Zhe Sun, Yanfu Zhang, Zhongli Xu, Hongyi Xin, Heng Huang, Richard H Duerr, Kong Chen, Ying Ding, Wei Chen. BREM-SC: A Bayesian Random Effects Mixture Model for Joint Clustering Single Cell Multi-omics Data. Nucleic Acids Research 2020.
Xinjun Wang ([email protected]), Wei Chen.