NGS has made gene analysis routine in clinics, yet many variants in genes of unknown function are labeled as variants of uncertain significance (VUSs) and do not contribute to a diagnosis of rare diseases, until further validated by experimental verification. Gene prioritization can help narrow down the list of candidate genes under consideration. .
Here we introduce single-cell tissue-specific gene prioritization using machine learning (STIGMA) to prioritize disease gene for congenital malformations. STIGMA predicts the disease-causing probability of genes based on their expression profiles across cell types, while considering the temporal dynamics during the embryogenesis of a healthy (wild-type) organism.
- Seurat
- Monocle3
- dplyr
- splines
- biomaRt
- AnnotationHub
- GenomicRanges
- BSgenome
- tidyr
- GenomicFeatures
- TxDb.Hsapiens.UCSC.hg38.knownGene
- sklearn
- imblern
- pandas
- numpy
- math
- sys
- itertools
- matplotlib
Inorder to run STIGMA we need to prepare the input matrix which consists of single cell features and gene intrinsic propersties, which can be done by following STEP1 and STEP2. STEP3 Formats the input.
- Load the Seurat object after clustering or subclustering.
- Utilize the 'AverageExpression' function to acquire cluster-specific gene expression.
- Employ the 'HVFInfo' function to compute variance in expression within each cluster.
- Determine the percentage of cells expressing the gene in each sub-cluster (PrctCellExpringGene).
- Calculate the fold-change in expression between each sub-cluster and the remaining cells (FoldChange).
- Proceed with the trajectory analysis pipeline, following Monocle's methodology for each partition/cluster.
- Bin the generated pseudo time and incorporate it as metadata into the Seurat object.
pseudo <- data.frame(pseudotime(monocleobject, reduction_method = "UMAP"))
colnames(pseudo)[1] <- 'pseudotime'
[email protected]$pseudotime<-pseudo$pseudotime[match(colnames(monocleobject), rownames(seuratobject))]
[email protected]$pseudotime.bin <- as.numeric(findInterval(seuratobject$pseudotime, quantile(seuratobject$pseudotime, seq(0,1, 1/10),na.rm=T)))
- Then, compute the average expression of all genes for each pseudo time bin column in the metadata.
avg_expr <- data.frame(AverageExpression(object=seuratobject, assays='RNA', slot='counts', group.by='pseudotime.bin'))
write.table(avg_expr,'Input_bsplines.tsv', sep='\t')
- Fit a spline by parsing Input_bsplines.tsv
Rscript featurePreprocess/Bsplines.R Input_bsplines.tsv.
- Extract gene intrinsic properties, such as promoter GC content, by executing
Rscript featurePreprocess/PromoterGC.R.
- Obtain gene constraints, including metrics like pLI, pNull, pRec, syn_Z, mis_Z, and lof_Z, for protein-coding genes from gnomAD. In the paper we have used an older version of gnomad(v.2.1.1).
- Retrieve gene GC content from Biomart.
- Once the features are obtained, they are stored as a tsv with column as features and rows as genes.
- Annotate the genes with positive(Known disease gene) and negative classes(House keeping genes).
- STIGMA gene prediction model can be optimized by parsing input.tsv
python3 model/RandomForest_optimization.py input.tsv
2. STIGMA gene prediction model to predict test genes with [input.tsv](https://github.com/SpielmannLab/STIGMA/blob/main/sample_dataset/input.tsv) by running <br />
python3 model/rf_model.py --inputfeature_matrix=input.tsv --candidategenes=candidate_genes.tsv --n_estimators=<Output from optimization> --max_depth=<Output from optimization> --min_samples_split=<Output from optimization> --min_samples_leaf=<Output from optimization> --max_features=<Output from optimization> --bootstrap=<Output from optimization> --n_neighbors=<Output from optimization>
## STEP5: STIGMA validation <br />
1. To optain the phenotypes associated with the congenital disease based on Monarch Initiative <br />
python3 validation/monarch_analysis.py
[1] Absolute paths are used at certain instances, which will need to be adapted, as needed. <br />
[2] Anaconda was used to set up the necessary environments <br />
## Citations
1. Balachandran, S., Prada-Medina, C.A., Mensah, M.A., Kakar, N., Nagel, I., Pozojevic, J., Audain, E., Hitz, M.-P., Kircher, M., Sreenivasan, V.K.A., et al. (2024). STIGMA: Single-cell tissue-specific gene prioritization using machine learning. Am J Hum Genet, S0002-9297(23)00443-3. https://doi.org/10.1016/j.ajhg.2023.12.011.