Hello! This is mine and Argusmocny - members of Laboratory of Structural Bioinformatics - solution to the Stanford Ribonanza RNA Folding competition. It is based on transformer models with overlapping window applied and was in the top 9% of the competition submissions (bronze medal awarded).
We used the PyTorch library to create our solution. The code in repo is divided into several parts:
network
- contains the models in themodels
directory, with thetransformer_ss.py
file being the main model file. Thedataloader
directory contains the dataloader used for the study, thetranswindpred_overlap.py
is the predictor used for final submission. There are also some other auxillary files used for training and submission.scripts
- contains many auxillary scripts used for training, submission, and data preprocessing.tests
- contains some tests for thenetwork
module, used mainly at the begging for net construction and pipeline testing.
We included two additional model-related directories:
hparams
- contains the hyperparameters used for training the model at various stages, in the form of.yml
files.data
- contains the model that was awarded.
There are also some other files and directories that are not included in the repository, such as the training, which contains the data used for training and submission, which had about 80 GB in total.
Model | Public score (best) | Private score (best) | Date |
---|---|---|---|
LSTM | 0.2722 | 0.2680 | Early October |
Basic transformer | 0.1738 | 0.20476 | Late October |
Transformer with secondary structure (SS) features added |
0.1601 | 0.18718 | Early November |
Transformer with SS, 4-fold CV and data error bias (DEB) |
0.1651 | 0.1854 | Mid November |
Transformer with SS, DEB sliding window and 4-fold CV |
0.16231 | 0.1626 | Late November |
Transformer with SS, DEB, overlapping sliding window and 4-fold CV BEST ONE |
0.16319 | 0.16266 | Early December |
Transformer with SS, DEB, overlapping sliding window, 4-fold CV and transfer learning FINAL |
0.16865 | 0.16535 | Moments before deadline |
- LSTM: We started with a simple LSTM model, for building automated pipeline for training and predicting the data, with the help of Weights & Bases API.
- Basic transformer: We then moved to a basic transformer model and solved invalid masking issue that had deletorius impact on the accuracy.
- Transformer with secondary structure (SS) features added: Afterwards we added secondary structure features calculated using auxillary data provided along with main train dataset.
- Transformer with SS, 4-fold CV and data error bias (DEB): Next step included application of data error bias, as each value measured experimentally had error included. We also added 4-fold cross-validation to the model based on sequence clustering using
cd-hit
algorithm (70% identity per cluster). Moreover, we exluded part of data which had SN_filter boolean set to False. - Transformer with SS, DEB and sliding window: Then we added a sliding window with constant size of 100 bp.
- Transformer with SS, DEB, overlapping sliding window and 4-fold CV: At the end we upgrade sliding window to overlap, what significantly enhanced the generalisation capabilities and hugely increased training data size.
- Transformer with SS, DEB, overlapping sliding window, 4-fold CV and transfer learning: We also tried transfer learning and played with hyperparameters, but it did not improve the model.
The SecStructModel
, found at network/models/transformer_ss.py
is a Transformer-based neural network model used for reactivity prediction. The model is designed to take in a sequence of RNA bases and output a experiment-wise reactivity prediction for each base.
-
Sinusoidal Positional Embedding: The model uses sinusoidal positional embeddings to capture the sequential nature of the protein data.
-
Embeddings: The model uses separate embeddings for sequence and secondary structure data, which are then concatenated.
-
Transformer: The model uses a Transformer encoder for processing the input data. The Transformer provides the ability to handle long-range dependencies in the data.
-
Output: The model outputs a single value for each position in the sequence, representing the reactivity value resembling DMS/2A3 experimental values of that position.
-
Sliding window: The model uses a sliding window approach to process the input data, which allows the model to capture local dependencies in the data.
-
Data error bias: The model uses a data error bias to account for the error in the experimental data.
-
4-fold cross-validation: The model uses 4-fold cross-validation to improve generalization and reduce overfitting. Predictions are averaged over the 4 folds to produce the final output. (In total 8 predictions are made, 4 for each experiment).
We planned using GNN (Graph Neural Networks) for solving the problem, but we did not have enough time to implement it. We believe that GNN could be a good solution for this problem, as it could capture the long-range dependencies in the data and provide a better prediction. Nevertheless, architecture of GNN model was partially prepared, but not implemented in this case - we are already developing it for another scientific project. You can find the GNN pre-model in the network/models
directory.
[ ]: example pipeline