This repo contains the plots and datasets I created and used when analyzing the output of the Facebook ESM 1v models on the T4-Lysozyme protein. The T4_analysis.ipynb notebook contains most of my work generating predictions with the 5 ESM 1v models. The notebook compares the predictions of the models with experimental data by calculating the Spearmen correlation between the log(Softmax) of the model's output to the recorded change in the change of the Gibbs free energy for every mutation. A strong correlation between these two variables would suggest that the model can accurately predict beneficial protein mutations without conducting costly and time consuimg deep mutation scanning (DMS) studies. The results from the ESM protein language models were compared to an approach based on protein structure to better determine whether giving protein models structural information improved their performance.
T4_analysis.py contains the same code as the notebook, but in the form of a Python script. I had hoped to turn the notebook into a script that could accept arguments and run from the command line, but as of right now this is incomplete. prot_analysis.yml contains the required packages for running T4_analysis.ipynb. The plots folder contains all plots outputed by the notebook, while the dms folder contains the raw output from each model saved as a csv file. The data_and_dms folder contains the experimental data that we compared the ESM model against saved as a csv file, as well as several other csv files that contain both the data and model output with various levels of detail.The T4_analysis notebook describes the amount of detail in each of these csv files every time one of the files is read or saved.
Lastly, the Sars_Cov_2 folder contains a similar analysis that was performed on Sars_Cov_2 antibodies. The plots and data within this directory are quite similar to the plots and data in the above directories, although they are much larger and have not been reviewed as thoroughly due to the size of the antibodies.