Simple snakemake pipeline for each scaling of AlphaFold2
- Version: 0.1.0
- Authors:
- Nick Youngblut [email protected]
- Maintainers:
- Nick Youngblut [email protected]
This snakemake pipeline handles the software install and cluster job submission/tracking.
Note: the pipeline was designed and tested for an SGE cluster. You may need to adapt the pipeline somewhat to work on other clusters or cloud computing services.
For failed cluster jobs, job resources are automatically escalated in an attempt to successfully complete the job, assuming that the job died due to a lack of cluster resources (eg., a lack of memory).
Alphafold is run as 2 parts:
- Generation of the MSAs
- Just CPUs required for database searching
- All subprocesses will use the same number of CPUs
- Unlike with the original alphafold code
- Prediction of protein structures
- GPU usage recommended (used by default)
To do this, the pipeline utilizes a modified version of alphafold. Only the user interface has been edited, and not how alphafold actually functions.
The setup is based upon the alphafold_non_docker.
NOTE: You may to change the location all of required databases if you do not have access to the
Clone the pipeline
git clone --recurse-submodules <alphafold_sm>
If you forgot to use --recurse-submodules
:
cd ./alphafold_sm/bin/
git submodule add https://github.com/leylabmpi/ll_pipeline_utils.git
git submodule add https://github.com/nick-youngblut/alphafold.git
git submodule update --remote --init --recursive
Download chemical properties to the common folder
wget -q -P bin/scripts/alphafold/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
You need a conda environment with snakemake installed.
Be sure to activate you snakemake conda environment!
You may need to download the required alphafold databases if you do not have access to the database files listed on the config.yaml
.
The pipeline processes each user-provided fasta separately, in parallel.
If running model_preset: monomer
, then each fasta should contain 1 sequence.
If running model_preset: multimer
, then each fasta can contain >=1 sequence.
You can use ./utils/seq_split.py
for splitting a multi-fasta into
per-sequence fasta files for input to this pipeline.
The config.yaml
file sets the parameters for the pipeline.
use_gpu:
- Only used if
cluster=True
, which is set automatically via using./snakemake_sge.sh
for running the pipeline on the MPI Bio. cluster. - If
cluster=False
(eg., if a run on a local server) then only CPUs will be used.
- Only used if
- Other params
- See the alphafold documentation
databases:
base_path:
- All databases are assumed to be within this path
- In other words, the
base_path
is prepended to all database paths
pipeline:
export_conda:
- Export all conda envs at the end of a success run
- If you delete the
./snakemake/conda/
directory, then BE SURE TO delete thepip_update.done
andpatch.done
files in the output directory, or you have to apply the pip update & patch manually to the alphafold conda environment that snakemake will automatically generate.
For general info on alphafold output, see the alphafold docs.
mTM-align is used for 2 sets of comparisons:
- Intra
- The
ranked_[0-9].pdb
structures are compared per-sample
- The
- Inter
- The
ranked_0.pdb
structures are compared between samples
- The
- Structure-based calculations
- structural comparison
- visualization