Fold U: A Protein Structure Prediction Program

Our program is the second step (downstream) of a protein structure prediction project. This step consists of threading a query sequence on different given templates.

This project is part of the Meet-U 2018-2019 competition. Meet-U is a collaborative pedagogical and research initiative between several Universities of Paris area. The course is intended to Master students (2nd year) in Bioinformatics. For more details, please refer to http://www.meet-u.org/.

Installation

Clone the repository

git clone https://github.com/meetU-MasterStudents/Fold_U.git
cd Fold_U

Requirements

A linux distribution.
Install the few required python packages :

pip3 install -r requirements.txt
# This command will install the following modules:
# docopt==0.6.2
# numpy==1.15.2
# biopython==1.72
# pandas==0.23.4
# schema==0.6.8
# tqdm==4.28.1
# matplotlib==2.2.2
# m2r # for Sphinx

R software environment. The full procedure is described here. R is used for the Machine Learning step. The required packages boot, dplyr and readr will be automatically installed if not already, you have nothing to do.

sudo apt-get install r-base

Download the uniref90 database here, put it in data/databases/uniref90 and format it with the following command line :

makeblastdb -in databases/uniref90.fasta -dbtype prot

Install blast-legacy with conda :

conda install -c bioconda blast-legacy

MODELLER is also required, and can be installed easily with Conda. You need to register to get a license key here, and follow instructions during installation to insert license key in the program.

conda install -c salilab modeller

If necessary, change the paths in the header of the following scripts : bin/psipred.4.02/runpsipred and bin/psipred.4.02/runpsipred_single

Generate the PSSM files for each template

Run the following script before running fold_u:

./bin/salut_1.0/salut1.sh data/templates/ data/metafold.list

Run the program

fold_u takes in input the protein sequence of the studied query (fasta format) and an uniref database.

It returns the mandatory outputs:

ranking.txt file with 3 columns Rank | Template | Score
The top N pdb structures (top 10 by default)

And additionnal outputs:

scores.csv: this CSV formatted file contains the complete results with scores values.
The MODELLER alignments.

Toy example

./fold_u data/queries/Agglutinin/Agglutinin.fasta data/databases/uniref90/uniref90 -o results/Agglutinin

Get help

./fold_u -h

Usage:
    ./fold_u QUERY_FASTA UNIREF_DB [--nb_pdb NUM] [--nb_psiblast NUM] [--cpu NUM] [--dope FILE]
                                   [--metafold FILE] [--benchmark FILE] [--output PATH]

Arguments:
    QUERY_FASTA                           Path to the QUERY fasta sequence to predicted.
    UNIREF_DB                             Path to Uniref database.

Options:
    -h, --help                            Show this
    -p NUM, --nb_pdb NUM                  Number of pdb to create
                                          [default: 10]
    -t NUM, --nb_psiblast                 Round number for PSI-BLAST
                                          [default: 3]
    -c NUM, --cpu NUM                     Number of cpus to use for parallelisation. By default
                                          using all available (0).
                                          [default: 0]
    -d FILE, --dope FILE                  Path to the dope.par file
                                          [default: data/dope.par]
    -m FILE, --metafold FILE              Path to the metafold.list file
                                          [default: data/metafold.list]
    -b FILE, --benchmark FILE             Path to the benchmark.list file
                                          [default: data/benchmark.list]
    -o PATH, --output PATH                Path to the directory containing
                                          the result files (scores and pdb)
                                          [default: ./results]

Run all the queries + Benchmarking

This program is also benchmarked using ROC style plots and Top N information to evaluate the power and the relevance of the different scores. The score results are generated for all queries. Each plot represents the cumulative sum of benchmarks encountered along the ranking (from rank 1 to rank 405) for each calculated scores. A top N results table is also generated showing the number of "Family", "Superfamily" and "Fold" benchmarks found in the top N ranks.

We wrote a script that runs the fold_u program for each query if results are not yet generated. It returns a results/plots folder containing the generated plots and prints the top N tables in the terminal.

./scripts/benchmarking.py data/databases/uniref90/uniref90

Get help

./scripts/benchmarking.py -h

    Usage:
        ./script/benchmarking.py UNIREF_DB [--selected_score SCORE] [--cpu NUM] [--output PATH]

    Arguments:
        UNIREF_DB                             Path to Uniref database.        

    Options:
        -h, --help                            Show this
        -s SCORE, --selected_score SCORE      Score for which you wish to see the statistics:
                                              "alignment", "threading", "modeller",
                                              "secondary_structure", "solvent_access"
                                              or "sum_scores",
                                              or all of them at once: "all" [default: all]
        -c NUM, --cpu NUM                     Number of cpus to use for parallelisation. By default
                                              using all available (0).
                                              [default: 0]
        -o PATH, --output PATH                Path to the directory containing
                                              the result files (scores and plot)
                                              [default: ./results/plots]

Machine Learning

The R script script/machine_learning.R uses logistic regression to find the best weights to apply to each type of score in order to optimize the benchmarking. That is to say it will learn the specificities of each scores according to the benchmarks (Fold, Superfamily and Family) in order to get the most information from each.

Benchmark results

Top N tables

Results for the benchmark done with the merged program (upstream + downstream). Templates are ranked with the combined (sum) score.

Table summarizing the top N results.

         Family    Superfamily    Fold        Total

top 5    0/1       0/6            1/13        1/19
         0.0  %    0.0  %         7.7  %      5.3  %
----------------------------------------------------
top 10   0/1       2/6            1/13        3/19
         0.0  %    33.3 %         7.7  %      15.8 %
----------------------------------------------------
top 15   0/1       2/6            1/13        3/19
         0.0  %    33.3 %         7.7  %      15.8 %
----------------------------------------------------
top 20   0/1       2/6            1/13        3/19
         0.0  %    33.3 %         7.7  %      15.8 %
----------------------------------------------------
top 25   0/1       2/6            1/13        3/19
         0.0  %    33.3 %         7.7  %      15.8 %
----------------------------------------------------
top 50   0/1       3/6            3/13        6/19
         0.0  %    50.0 %         23.1 %      31.6 %
----------------------------------------------------
top 75   0/1       4/6            5/13        9/19
         0.0  %    66.7 %         38.5 %      47.4 %
----------------------------------------------------
top 100  0/1       5/6            6/13        11/19
         0.0  %    83.3 %         46.2 %      57.9 %
----------------------------------------------------
top 150  0/1       5/6            7/13        12/19
         0.0  %    83.3 %         53.8 %      63.2 %
----------------------------------------------------
top 200  0/1       5/6            7/13        12/19
         0.0  %    83.3 %         53.8 %      63.2 %
----------------------------------------------------
top 250  0/1       6/6            8/13        14/19
         0.0  %    100.0%         61.5 %      73.7 %
----------------------------------------------------
top 300  0/1       6/6            9/13        15/19
         0.0  %    100.0%         69.2 %      78.9 %
----------------------------------------------------
top 350  1/1       6/6            12/13       19/19
         100.0%    100.0%         92.3 %      100.0%
----------------------------------------------------

Templates ranked with the weighted combined score. We can see that the program is less specific and more sensitive. The machine learning make the program find the benchmarks less fast than without, but all benchmarks are found faster than without machine learning ! That is important as it enables the program to work for more kinds of proteins.

Table summarizing the top N results.

         Family    Superfamily    Fold        Total

top 5    0/1       0/6            0/13        0/20
         0.0  %    0.0  %         0.0  %      0.0  %
----------------------------------------------------
top 10   0/1       0/6            0/13        0/20
         0.0  %    0.0  %         0.0  %      0.0  %
----------------------------------------------------
top 15   0/1       0/6            1/13        1/20
         0.0  %    0.0  %         7.7  %      5.0  %
----------------------------------------------------
top 20   0/1       1/6            3/13        4/20
         0.0  %    16.7 %         23.1 %      20.0 %
----------------------------------------------------
top 25   0/1       1/6            3/13        4/20
         0.0  %    16.7 %         23.1 %      20.0 %
----------------------------------------------------
top 50   0/1       3/6            3/13        6/20
         0.0  %    50.0 %         23.1 %      30.0 %
----------------------------------------------------
top 75   0/1       4/6            4/13        8/20
         0.0  %    66.7 %         30.8 %      40.0 %
----------------------------------------------------
top 100  0/1       4/6            6/13        10/20
         0.0  %    66.7 %         46.2 %      50.0 %
----------------------------------------------------
top 150  0/1       5/6            9/13        14/20
         0.0  %    83.3 %         69.2 %      70.0 %
----------------------------------------------------
top 200  0/1       6/6            12/13       18/20
         0.0  %    100.0%         92.3 %      90.0 %
----------------------------------------------------
top 250  1/1       6/6            13/13       20/20
         100.0%    100.0%         100.0%      100.0%
----------------------------------------------------
top 300  1/1       6/6            13/13       20/20
         100.0%    100.0%         100.0%      100.0%
----------------------------------------------------
top 350  1/1       6/6            13/13       20/20
         100.0%    100.0%         100.0%      100.0%
----------------------------------------------------

Generated plot

Documentation

The documentation of our program is generated with Sphinx and and built on Read The Docs.

Authors

We are master students in bioinformatics at Paris Diderot University.

Acknowledgment

Thanks to Maïté Cretin for the nice logo.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 406 Commits
bin		bin
data		data
docs		docs
example/TBCA		example/TBCA
img		img
results		results
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
fold_u		fold_u
requirements.txt		requirements.txt
slides_team4.pdf		slides_team4.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fold U: A Protein Structure Prediction Program

Installation

Clone the repository

Requirements

Generate the PSSM files for each template

Run the program

Toy example

Get help

Run all the queries + Benchmarking

Get help

Machine Learning

Benchmark results

Top N tables

Generated plot

Documentation

Authors

Acknowledgment

License

About

Releases

Packages

Languages

License

gabrielctn/Fold_U

Folders and files

Latest commit

History

Repository files navigation

Fold U: A Protein Structure Prediction Program

Installation

Clone the repository

Requirements

Generate the PSSM files for each template

Run the program

Toy example

Get help

Run all the queries + Benchmarking

Get help

Machine Learning

Benchmark results

Top N tables

Generated plot

Documentation

Authors

Acknowledgment

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages