Binder design pipelines. Mostly still work in process..
Idea is to get get initial binder sequence, that binds to the target protein.
If you dont care about where binders bind or want to explore prefered binding spots, start with 01a_binder_random_dock.ipynb
.
If you want to target specific residues, start with start with 01b_binder_dock.ipynb
for docking scaffolds (usually 3HB) or 01c_binder_dock.ipynb
for docking random scaffolds.
In the 01a_binder_random_location_dock.ipynb
notebook, main thing you need to do is to change parameters, especailly path to target_pdb
, target_chain
and setup number of iterations
.
Final command will look somewhat like this:
sbatch --output=/dev/null --array=0-{scaf_num-1}%{array_lim} helper_scripts/predict_one_binder_random_loc.sh {target_pdb} {target_chain} {scaf_dir} {out_dir} {iterations} {designs_per_iteration} {proteinmpnn_per_input} {recycles} {sampling_temp}
target_pdb
- path to target pdbtarget_chain
- name of target chain (can be multiple chains)scaf_dir
- path to directory with binder scaffold pdbsout_dir
- output directoryiterations
- number of iterations to perform when optimizing one binder (usually 10-15 is enough)designs_per_iteration
- total number of (best) designs to go into next iteration as inputs (default: 10)proteinmpnn_per_input
- total number of sequences to sample per input per iteration (default: 5)recycles
- number of AF2 recycles (default: 12)sampling_temp
- ProteinMPNN sampling temperature (default: 0.15)
Note: Continue with
015b_binder_random_dock_analysis.ipynb
notebook, to quickly get best pdbs for visual inspection
In the 01b_binder_scaffold_dock.ipynb
notebook, you need to setup target_pdb
path and target proteinhotspots
, specify scaffold_dir
(with/without backbone threading before ProteinMPNN) and a number of dockings (num_of_diffusions
) to perform for each scaffold.
Final command:
sbatch --output=/dev/null --array=1-{array_number}%{array_limit} helper_scripts/binder_dock.sh {num_of_diffusions} {total_mpnn} {mpnn_per_design} {num_recycles} {sampling_temp} {scaffold_dir} {prefix} {target_pdb} {hotspots}
num_of_diffusions
- number of dockings to performtotal_mpnn
- total number of sequences for per designmpnn_per_design
- total number of sequences to sample per design (filter fromtotal_mpnn
)num_recycles
- number of AF2 recycles (default: 12)sampling_temp
- ProteinMPNN sampling temperature (default: 0.15)scaffold_dir
- path to directory with binder scaffolds ()target_pdb
- path to target pdbhotspots
- hotspot residues to target with binder (usually around 5)
Note: If the Scaffold directory contains multiple subdirectories, the script will run each directory separately.
In the 01c_random_binder_scaffold_dock.ipynb
notebook, you need to setup target_pdb
path and target proteinhotspots
, number of dockings (num_of_diffusions
) to perform for each scaffold, along with other MPNN/AF2 parameters.
Make sure to setup config file for binder prediction, specifying desired size.
Final command:
sbatch --output=/dev/null --array=0-{array_number-1}%{array_limit} helper_scripts/binder_dock_random_scaff.sh {num_of_diffusions} {total_mpnn} {mpnn_per_design} {num_recycles} {sampling_temp} {prefix} {target_pdb} {hotspots} {contigs} {thread_binders}
num_of_diffusions
- number of dockings to performtotal_mpnn
- total number of sequences for per designmpnn_per_design
- total number of sequences to sample per design (filter fromtotal_mpnn
)num_recycles
- number of AF2 recycles (default: 12)sampling_temp
- ProteinMPNN sampling temperature (default: 0.15)prefix
- input nametarget_pdb
- path to target pdbhotspots
- hotspot residues to target with binder (usually around 5)contigs
- RFdiffusion contigs for binder generationthread_binders
- select if you want to thread binder scaffold with sequence and FastRelax before designing
The idea of 02_binder_optimization.ipynb
notebook to get primary binders and optimize them with iterations of AF2 and ProteinMPNN, sorting by i_pae or some custom scoring function. Filter the binders and for each binder run an optimization script. Make sure to specify number of iterations
and a way to sort
.
Final command:
sbatch --output=/dev/null --array=0-{array_number-1}%{array_limit} helper_scripts/binder_opt.sh {binders_input} {output_dir} {iterations} {starting_mpnns} {designs_per_iteration} {mpnn_per_design} {sort} {af2_recycles} {mpnn_sampling_temp}
binders_input
- path to directory with input binder pdbsoutput_dir
- output directoryiterations
- number of iterations per binder (10-15 is usually enough)starting_mpnns
- number of sampled sequences from input binderdesigns_per_iteration
- total number of (best) designs to go into next iteration as inputs (default: 10)mpnn_per_design
- total number of sequences to sample per input per iteration (default: 5)sort
- sort metrics between iterations (default: "i_pae")af2_recycles
- number of AF2 recycles (default: 12)mpnn_sampling_temp
- ProteinMPNN sampling temperature (default: 0.15)
The idea of 03_binder_analysis.ipynb
notebook is to filter optimized sequences and calculate additional metrics, that would help use discriminate good binders from the bad ones. The notebook divides datasets into smaller chunks of 1000 and subsequently executes them individually on a cluster.
Final command:
sbatch --output=/dev/null --array=0-{array_number-1}%{array_limit} helper_scripts/binder_analysis.sh {input_file} {target_chain} {binder_chain} {metric_path}
input_file
- list of proteins to calculate metrics for (1000 per array job)target_chain
- target chainbinder_chain
- binder chainmetric_path
- file for writing results (prepared with notebook)xml_file
- rosetta xml file
The idea of 04_binder_filter.ipynb
notebook is to filter binders by different metrics, predict them again with AF2 and RF2 (and ESMfold), cluster similar sequences together and group final results.
Workflow:
- filter sequences based on the metrics from
03_binder_analysis.notebook
(i_pae, ddg_dsasa_100, charge, hyd_contacts, shape_comp, ...) - validate sequences again with different prediciton methods (AlphaFold2, RoseTTAfold2, ESMfold)
sbatch --output=/dev/null --array=0-{array_number-1}%{array_limit} helper_scripts/predict_binders.sh {msa_folder} {output_path} {binder_analysis} {prediction_tools}
msa_folder
- folder with input msa files
output_path
- output folder
binder_analysis
- flags for binder analysis (define if you want binder analysis and the position of binder - binder or binder-second)
prediction_tools
- prediction tools to use (colabfold, rosettafold2 or/and ...)
- second round of filtering, based on validation results (af2_plddt, af2_pae, af2_rmsd, ...)
- clustering similar sequences together and prepare final binder sequences to order