A machine learning workflow to predict gene regulon membership based on promoter sequence features, focusing on top-down regulons derived from an Independent Component Analysis (ICA) of the PRECISE E. coli RNAseq database.
To learn about ICA, how ICA components are computed, and what they can tell you, please visit https://imodulondb.org/about.html
- Generate SigmaFactor PSSMs
- Feature Matrix Generation (This generates a ~200MB file necessary for machine learning)
- Feature Engineering
- Machine learing: model training and hyperparameter optimization
- ArcA Direct Repeats motifs to improve model performance
The workflow depends on:
- bitome: https://github.com/SBRG/bitome
- pymodulon: https://github.com/SBRG/pymodulon
- DNAshapeR:https://github.com/TsuPeiChiu/DNAshapeR
- scikit-learn: https://scikit-learn.org/stable/
- seaborn statistical data visualization:https://seaborn.pydata.org/index.html
Recommended package versions are:
Python==3.8
seaborn==0.12.2
numpy==1.24.3
matplotlib==3.7.1
pandas==1.5.3
biopython==1.78
Qiu, S., Lamoureux, C., Akbari, A., Palsson, B. O., & Zielinski, D. C. (2022). Quantitative sequence basis for the E. coli transcriptional regulatory network. https://doi.org/10.1101/2022.02.20.481200