Skip to content

lixt314/SingleRMEA

Repository files navigation

SingleRMEA: Multiobjective Cell Type Discovery from Single-Cell RNA-seq data

The development of single-cell RNA-seq technologies provides new opportunities for biology since it has become an accepted experimental method throughput improvements enabling applications for cell type discovery. However, high-throughput applications of single-cell RNA-seq to solid tissues rely on the formal cell type definitions. Unfortunately, it is unclear how to formulate such definitions to analyze the crucial information on cells' location further since high levels of technical noise in most data. To address this challenge, we present a computational method (SingleRMEA) to conduct the robust cell type classifiers including clustering and classification to address large-scale PBMC dataset and human tissue sources composed of complex mixtures of cell types and subtypes. For clustering, clustering by fast search and find of density peaks (CDP) is employed to perform clustering to partition the cells into a few distinct subpopulations. For classification, we develop a new ensemble construction method to predict the cell type for single-cell RNA-seq, which applies multiobjective optimization to the stacking ensemble construction process to generate domain-specific configurations under two objective functions. To validate our SingleRMEA method, we compare its performance across two PBMCs datasets including PBMC-4k and PBMC-12 merged data. The experimental results demonstrate that the SingleRMEA can obtain superior performance over the current state-of-the-art methods. Meanwhile, it also demonstrates that SingleRMEA can enable the construction of cell type classifiers that can be direct to other new single-cell RNA-seq data. SingleRMEA is written in Matlab and available at https://github.com/lixt314/SingleRMEA.

Data

We have collected the immune cell single-cell RNA-seq profiles across the human datasets from \citet{schelker2017estimation}. In particular, two PBMC-merged data sources are considered by interpreting three different PBMCs data with the melanoma patient samples and ovarian cancer ascites samples. The PBMCs data includes PBMC-4k and PBMC-12k. The first dataset (PBMC-4k merged) includes 4k PBMCs \footnote{https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k} from a healthy donor, which contains 4000 single cells sequenced on Illumina Hiseq4000 with approximately 87,000 reads per cell \cite{zheng2017massively}, 4645 tumor-derived single cells from 19 melanoma patient samples, 3114 single cells from four ovarian cancer ascites samples. The second dataset (PBMC-12k merged) contains the 4k PBMCs and the 8k PBMCs\footnote{https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k} from a healthy donor, melanoma patient samples, and four ovarian cancer ascites samples. The 8k PBMCs has 8,381 cells detected sequenced on Illumina Hiseq4000 with approximately 92,000 reads per cell \cite{zheng2017massively}. The performance measure reported by 10-fold cross-validation is the average of the values calculated.

Load data

SingleRMEA accepts as input a matrix of raw gene counts with genes as rows and cells as columns. The table should have the format shown below.

Normalization by housekeeping genes

Importantly, when we merged those different samples, there exist some bath effects. To address this problem, we apply the to select the housekeeping genes for normalization to decrease statistical power. To minimize platform-dependent errors, we select the housekeeping genes for normalization. In this study, we employed the 3804 housekeeping genes (HK_genes.mat) to normalize the single-cell RNA-seq data.

% restrict to common genes
[~, ia, ib] = intersect(data1.Properties.RowNames,data2.Properties.RowNames);
data1 = data1(ia,:);
data2 = data2(ib,:);
clear ia ib;

% load house-keeping genes
load('HK_genes.mat');

% find common genes
[~, ~, ia] = intersect(hk_genes,data1.Properties.RowNames);

% convert to TPM scale
data1 = logTrafo(data1,-1);
data2 = logTrafo(data2,-1);

% normalize to house-keeping gene expression
hk_expr = mean([data1{ia,:} data2{ia,:}],1);
id1 = [true(1,size(data1,2)) false(1,size(data2,2))];
id2 = [false(1,size(data1,2)) true(1,size(data2,2))];

data1{:,:} = bsxfun(@times,data1{:,:},mean(hk_expr)./hk_expr(id1));
data2{:,:} = bsxfun(@times,data2{:,:},mean(hk_expr)./hk_expr(id2));

% convert back to log scale
data1 = logTrafo(data1,1);
data2 = logTrafo(data2,1);

Dimensionality Reduction (t-SNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets.

We obtained the code from the https://lvdmaaten.github.io/tsne/

Clustering

In the first phase, t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied to identify similar cells. After that, since the cell labels and the number of clusters are available, different clustering algorithms including Density-based spatial clustering of applications with noise (DBSCAN) and CDP are employed for clustering based on the t-SNE mapping on PBMC-4K merged data

run cluster_dp.m

Marker genes

We select the marker genes from the CellMarker database (http://biocc.hrbmu.edu.cn/CellMarker/ or http://bio-bigdata.hrbmu.edu.cn/CellMarker/), aiming to provide a comprehensive and accurate resource of cell markers for various cell types in tissues of human.

[celltype_matrix, celltype_expression, cellnames, celltype] = celltype_markers(data,tsneX,id_celltype,thresh,id)

genes = {'Unknown',{},{},{};...

'T cells',{'CD3D','CD3E','CD3G','CD27','CD28'},{},{};...

'CD4+ T cells',{'CD4'},{},{};...

'CD8+ T cells',{'CD8B'},{},{};...

'regulatory T cells',{'FOXP3','IL2RA','CD4','CTLA4'},{},{};...

'B cells',{'CD19','MS4A1','CD79A','CD79B','BLNK'},{},{};...

'Macrophages/Monocytes',{'CD14','CD68','CD163','CSF1R','FCGR3A'},{},{};...

'Dendritic cells',{'IL3RA','CLEC4C','NRP1'},{},{};...

'Natural killer cells',{'FCGR3A','FCGR3B','NCAM1','KLRB1','KLRB1','KLRC1','KLRD1','KLRF1','KLRK1'},{},{};...

'Endothelial cells',{'VWF','CDH5','SELE'},{},{};...

'Cancer associated fibroblasts',{'FAP','THY1','COL1A1','COL3A1'},{},{};...

'Ovarian carcinoma cells',{'WFDC2','EPCAM','MCAM'},{},{};...

'Melanoma cells',{'PMEL','MLANA','TYR','MITF'},{},{}};

Classify

Most of those previous methods only employ single supervised classification method to identify the cell types. Unfortunately, it is hardly believed that each of those supervised classification methods can be the all-time winner across all datasets for single-cell RNA-seq interpretations. Each supervised classification method has its own strengths and weaknesses; different supervised classification methods are assumed to provide different performance on different single-cell RNA-seq datasets. Therefore, we develop a new ensemble construction method to infer novel cell types from single-cell RNA-seq. Multiobjective optimization is designed for the stacking ensemble construction process to generate platform-specific configurations. Furthermore, we demonstrate the capabilities of our framework on the PBMC data.

[predicted_celltype] = Ensemble_Classify(data,training_celltype)

Compare the predicted celltype results to the true labels.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages