Gregory Way, 2018
Gene expression data compression reveals coordinated gene expression modules that describe important biology.
In the following analysis, we apply five different compression algorithms to three different gene expression datasets. We sequentially compress input data across different bottleneck dimensions (k). We save the population of all models, for each algorithm, across k for downstream analyses.
We compress gene expression data with the following algorithms:
Algorithm | Implementation |
---|---|
Principal Components Analysis (PCA) | sklearn |
Independent Components Analysis (ICA) | sklearn |
Non-Negative Matrix Factorization (NMF) | sklearn |
Analysis of Denoising Autoencoders for Gene Expression (ADAGE) | tybalt.models.Adage |
Variational Autoencoder (VAE; Tybalt) | tybalt.models.Tybalt |
We will evaluate the solutions across the ensemble population over all k dimensions. For each of the populations, we will also track performance of training and testing sets independently.
- Reconstruction Cost - Measures the binary cross entropy of input data to reconstruction
- Training History - For neural network models (ADAGE, Tybalt), save the training progress of each model
- For Tybalt, the KL Divergence and Reconstruction Loss are saved separately
- Correlation of input sample to reconstructed sample - Measure how well certain samples traverse through the bottleneck.
- Calculate Pearson and Spearman correlations
- May reveal certain biases in sample reconstruction efficiency across algorithms
The population of weight and latent space matrices are saved for alternative downstream analyses.
This module takes a long time to run. For convenience, we include the option to download archived and versioned results from zenodo.
To acquire these results, perform the following:
conda activate biobombe
cd 2.sequential-compression
python download-biobombe-archive.py
To rerun the analysis from scratch, perform the following:
conda activate biobombe
# Navigate into this module folder
cd 2.sequential-compression
./analysis.sh