Stanford Ribonanza RNA Folding competition - transformer model with overlapping window solution

Hello! This is mine and Argusmocny - members of Laboratory of Structural Bioinformatics - solution to the Stanford Ribonanza RNA Folding competition. It is based on transformer models with overlapping window applied and was in the top 9% of the competition submissions (bronze medal awarded).

We used the PyTorch library to create our solution. The code in repo is divided into several parts:

network - contains the models in the models directory, with the transformer_ss.py file being the main model file. The dataloader directory contains the dataloader used for the study, the transwindpred_overlap.py is the predictor used for final submission. There are also some other auxillary files used for training and submission.
scripts - contains many auxillary scripts used for training, submission, and data preprocessing.
tests - contains some tests for the network module, used mainly at the begging for net construction and pipeline testing.

We included two additional model-related directories:

hparams - contains the hyperparameters used for training the model at various stages, in the form of .yml files.
data - contains the model that was awarded.

There are also some other files and directories that are not included in the repository, such as the training, which contains the data used for training and submission, which had about 80 GB in total.

Model history and results

Model	Public score (best)	Private score (best)	Date
LSTM	0.2722	0.2680	Early October
Basic transformer	0.1738	0.20476	Late October
Transformer with secondary structure (SS) features added	0.1601	0.18718	Early November
Transformer with SS, 4-fold CV and data error bias (DEB)	0.1651	0.1854	Mid November
Transformer with SS, DEB sliding window and 4-fold CV	0.16231	0.1626	Late November
Transformer with SS, DEB, overlapping sliding window and 4-fold CV BEST ONE	0.16319	0.16266	Early December
Transformer with SS, DEB, overlapping sliding window, 4-fold CV and transfer learning FINAL	0.16865	0.16535	Moments before deadline

Model development

LSTM: We started with a simple LSTM model, for building automated pipeline for training and predicting the data, with the help of Weights & Bases API.
Basic transformer: We then moved to a basic transformer model and solved invalid masking issue that had deletorius impact on the accuracy.
Transformer with secondary structure (SS) features added: Afterwards we added secondary structure features calculated using auxillary data provided along with main train dataset.
Transformer with SS, 4-fold CV and data error bias (DEB): Next step included application of data error bias, as each value measured experimentally had error included. We also added 4-fold cross-validation to the model based on sequence clustering using cd-hit algorithm (70% identity per cluster). Moreover, we exluded part of data which had SN_filter boolean set to False.
Transformer with SS, DEB and sliding window: Then we added a sliding window with constant size of 100 bp.
Transformer with SS, DEB, overlapping sliding window and 4-fold CV: At the end we upgrade sliding window to overlap, what significantly enhanced the generalisation capabilities and hugely increased training data size.
Transformer with SS, DEB, overlapping sliding window, 4-fold CV and transfer learning: We also tried transfer learning and played with hyperparameters, but it did not improve the model.

Final model description

The SecStructModel, found at network/models/transformer_ss.py is a Transformer-based neural network model used for reactivity prediction. The model is designed to take in a sequence of RNA bases and output a experiment-wise reactivity prediction for each base.

Key Features

Sinusoidal Positional Embedding: The model uses sinusoidal positional embeddings to capture the sequential nature of the protein data.
Embeddings: The model uses separate embeddings for sequence and secondary structure data, which are then concatenated.
Transformer: The model uses a Transformer encoder for processing the input data. The Transformer provides the ability to handle long-range dependencies in the data.
Output: The model outputs a single value for each position in the sequence, representing the reactivity value resembling DMS/2A3 experimental values of that position.
Sliding window: The model uses a sliding window approach to process the input data, which allows the model to capture local dependencies in the data.
Data error bias: The model uses a data error bias to account for the error in the experimental data.
4-fold cross-validation: The model uses 4-fold cross-validation to improve generalization and reduce overfitting. Predictions are averaged over the 4 folds to produce the final output. (In total 8 predictions are made, 4 for each experiment).

Side note

We planned using GNN (Graph Neural Networks) for solving the problem, but we did not have enough time to implement it. We believe that GNN could be a good solution for this problem, as it could capture the long-range dependencies in the data and provide a better prediction. Nevertheless, architecture of GNN model was partially prepared, but not implemented in this case - we are already developing it for another scientific project. You can find the GNN pre-model in the network/models directory.

TODO

[ ]: example pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
example		example
hparams		hparams
network		network
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stanford Ribonanza RNA Folding competition - transformer model with overlapping window solution

Model history and results

Model development

Final model description

Key Features

Side note

TODO

About

Releases

Packages

Languages

License

Rmadeye/ribonanza_rna

Folders and files

Latest commit

History

Repository files navigation

Stanford Ribonanza RNA Folding competition - transformer model with overlapping window solution

Model history and results

Model development

Final model description

Key Features

Side note

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages