-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial (Exome scale)
This tutorial shows you how to setup protein structures, and to run HotMAPS on mutations that were previously mapped to protein structures. You do not need MySQL for this tutorial. In a [subsequent tutorial](Advance tutorial), if you want to run your own mutations through HotMAPS, you will need to load the MuPIT MySQL database (see [here](MySQL database)).
First, download the Protein Data Bank (PDB) structures from ftp://ftp.wwpdb.org/pub/pdb/ and the theoretical protein structure models (https://salilab.org/modbase-download/projects/genomes/H_sapiens/2013/). You will need both the RefSeq and Ensembl theoretical protein structure models (H_sapiens_2013.tar.xz and ModBase_H_sapiens_2013_refseq.tar.xz, respectively). We advise you look at the instructions for PDB structures, available here. One command to download the structures is below:
$ rsync -rlpt -v -z --delete --port=33444 rsync.rcsb.org::ftp_data/ ./my_pdb_data_dir
This will create a new clean directory ./my_pdb_data_dir
containing all the needed PDB structures. Be aware the download may be somewhat large though.
Next, update the config.txt to point toward the directories that you save the structure files at after extracting the theoretical models from compressed format. This will involve changing the base directories modbase_dir
and pdb_dir
, and the matching sub-directory paths for refseq_homology
, ensembl_homology
, biological_assembly
and non_biological_assembly
for your custom location for the protein structures. Additionally, download the mutations file, protein structure annotation file, and annotations for the CRAVAT reference transcript available here. Place all three files in a sub-directory called "data". Assuming you are already in the HotMAPS directory:
$ mkdir -p data
$ cd data
$ wget https://www.dropbox.com/scl/fi/jk1repun20wachbps2zii/mutations.txt.gz?rlkey=udp4k9f9siiuykqauj1m3b9xr&st=sof5rmoz&dl=1 -o mutations.txt.gz
$ gunzip mutations.txt.gz
$ wget https://www.dropbox.com/scl/fi/0lklxk9h8fkzzwrz9ge1x/pdb_info.txt.gz?rlkey=uadoiylkv1pcuaed267q751eu&st=n1ds6yhq&dl=1 -o pdb_info.txt.gz
$ gunzip pdb_info.txt.gz
$ wget https://www.dropbox.com/scl/fi/r2k3q0p2e4hmu2t6vqtzk/mupit_annotations.tar.gz?rlkey=6hj49sp4wlw97o1qxgtf2te9h&st=shsuxqvr&dl=1 -o mupit_annotations.tar.gz
$ tar xvzf mupit_annotations.tar.gz
$ cd ..
Assuming you have changed the config.txt
file to point towards where you downloaded the protein structure files, an additional step is needed to annotate those protein structures.
$ make annotateStructures
To run the code in parallel using Sun Grid Engine (SGE) execute the following make command:
$ make OUTPUT_DIR=myoutput_dir runParallelHotspot
To run the code normally (no parallelization) execute:
$ make OUTPUT_DIR=myoutput_dir runNormalHotspot
myoutput_dir
is the output directory (Default: output/all_pdb_run).
Note if you ran the normal version instead of parallel, you need not run this next step as the merged file will already be produced. To merge the output from the parallel runs use the following make command:
$ make OUTPUT_DIR=myoutput mergeHotspotFiles
Next, the p-values need to be adjusted for multiple hypotheses testing.
This needs the CRAVAT reference transcript files noted in the Initital Setup
section that was saved in the "data" sub-directory (parameter MUPIT_ANNOTATION_DIR
in the make command).
$ make multipleTestCorrect OUTPUT_DIR=myoutput MUPIT_ANNOTATION_DIR=annotation_dir Q_VALUE=myqvalue
myqvalue
is the q-value for the False Discovery Rate (FDR) correction (.01 by default). The next step is group significant residues
into regions. If you are interested in regions on the actual PDB protein structure,
script use the following command:
$ make findHotregionStruct OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
Where like before myoutput_dir
is the output directory and myqvalue
is the
q-value (Default: .01). Similarly the regions can be constructed for each gene
using the reference transcript selected by CRAVAT for each mutation.
$ make findHotregionGene OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir