Skip to content

Rmadeye/ribonanza_rna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python VersionPyTorch VersionCUDA Version

Stanford Ribonanza RNA Folding competition - transformer model with overlapping window solution

Hello! This is mine and Argusmocny - members of Laboratory of Structural Bioinformatics - solution to the Stanford Ribonanza RNA Folding competition. It is based on transformer models with overlapping window applied and was in the top 9% of the competition submissions (bronze medal awarded).

We used the PyTorch library to create our solution. The code in repo is divided into several parts:

  • network - contains the models in the models directory, with the transformer_ss.py file being the main model file. The dataloader directory contains the dataloader used for the study, the transwindpred_overlap.py is the predictor used for final submission. There are also some other auxillary files used for training and submission.
  • scripts - contains many auxillary scripts used for training, submission, and data preprocessing.
  • tests - contains some tests for the network module, used mainly at the begging for net construction and pipeline testing.

We included two additional model-related directories:

  • hparams - contains the hyperparameters used for training the model at various stages, in the form of .yml files.
  • data - contains the model that was awarded.

There are also some other files and directories that are not included in the repository, such as the training, which contains the data used for training and submission, which had about 80 GB in total.

Model history and results

Model Public score (best) Private score (best) Date
LSTM 0.2722 0.2680 Early October
Basic transformer 0.1738 0.20476 Late October
Transformer with secondary
structure (SS) features added
0.1601 0.18718 Early November
Transformer with SS, 4-fold CV
and data error bias (DEB)
0.1651 0.1854 Mid November
Transformer with SS, DEB
sliding window and 4-fold CV
0.16231 0.1626 Late November
Transformer with SS, DEB,
overlapping sliding window
and 4-fold CV
BEST ONE
0.16319 0.16266 Early December
Transformer with SS, DEB,
overlapping sliding window,
4-fold CV and transfer learning
FINAL
0.16865 0.16535 Moments before deadline

Model development

  1. LSTM: We started with a simple LSTM model, for building automated pipeline for training and predicting the data, with the help of Weights & Bases API.
  2. Basic transformer: We then moved to a basic transformer model and solved invalid masking issue that had deletorius impact on the accuracy.
  3. Transformer with secondary structure (SS) features added: Afterwards we added secondary structure features calculated using auxillary data provided along with main train dataset.
  4. Transformer with SS, 4-fold CV and data error bias (DEB): Next step included application of data error bias, as each value measured experimentally had error included. We also added 4-fold cross-validation to the model based on sequence clustering using cd-hit algorithm (70% identity per cluster). Moreover, we exluded part of data which had SN_filter boolean set to False.
  5. Transformer with SS, DEB and sliding window: Then we added a sliding window with constant size of 100 bp.
  6. Transformer with SS, DEB, overlapping sliding window and 4-fold CV: At the end we upgrade sliding window to overlap, what significantly enhanced the generalisation capabilities and hugely increased training data size.
  7. Transformer with SS, DEB, overlapping sliding window, 4-fold CV and transfer learning: We also tried transfer learning and played with hyperparameters, but it did not improve the model.

Final model description

The SecStructModel, found at network/models/transformer_ss.py is a Transformer-based neural network model used for reactivity prediction. The model is designed to take in a sequence of RNA bases and output a experiment-wise reactivity prediction for each base.

Key Features

  1. Sinusoidal Positional Embedding: The model uses sinusoidal positional embeddings to capture the sequential nature of the protein data.

  2. Embeddings: The model uses separate embeddings for sequence and secondary structure data, which are then concatenated.

  3. Transformer: The model uses a Transformer encoder for processing the input data. The Transformer provides the ability to handle long-range dependencies in the data.

  4. Output: The model outputs a single value for each position in the sequence, representing the reactivity value resembling DMS/2A3 experimental values of that position.

  5. Sliding window: The model uses a sliding window approach to process the input data, which allows the model to capture local dependencies in the data.

  6. Data error bias: The model uses a data error bias to account for the error in the experimental data.

  7. 4-fold cross-validation: The model uses 4-fold cross-validation to improve generalization and reduce overfitting. Predictions are averaged over the 4 folds to produce the final output. (In total 8 predictions are made, 4 for each experiment).

Side note

We planned using GNN (Graph Neural Networks) for solving the problem, but we did not have enough time to implement it. We believe that GNN could be a good solution for this problem, as it could capture the long-range dependencies in the data and provide a better prediction. Nevertheless, architecture of GNN model was partially prepared, but not implemented in this case - we are already developing it for another scientific project. You can find the GNN pre-model in the network/models directory.

TODO

[ ]: example pipeline

About

top 10% solution to Ribonanza challenge on Kaggle

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published