All test samples and genome data are shared on Stanford Sherlock cluster. You don't have to download any data for testing our pipeline on it.
-
SSH to Sherlock's login node.
$ ssh login.sherlock.stanford.edu
-
Git clone this pipeline and move into it.
$ git clone https://github.com/ENCODE-DCC/chip-seq-pipeline2 $ cd chip-seq-pipeline2
-
Download cromwell.
$ wget https://github.com/broadinstitute/cromwell/releases/download/34/cromwell-34.jar $ chmod +rx cromwell-34.jar
-
Set your partition in
workflow_opts/sherlock.json
. PIPELINE WILL NOT WORK WITHOUT A PAID SLURM PARTITION DUE TO LIMITED RESOURCE SETTINGS FOR FREE USERS. Ignore other runtime attributes for singularity.{ "default_runtime_attributes" : { "slurm_partition": "YOUR_SLURM_PARTITON" } }
Our pipeline supports both Conda and Singularity.
-
Install Conda dependencies.
$ bash conda/uninstall_dependencies.sh # to remove any existing pipeline env $ bash conda/install_dependencies.sh
-
Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR936XTK.
$ source activate encode-chip-seq-pipeline # IMPORTANT! $ INPUT=examples/sherlock/ENCSR936XTK_subsampled_sherlock.json $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm cromwell-34.jar run chip.wdl -i ${INPUT} -o workflow_opts/sherlock.json
-
It will take about an hour. You will be able to find all outputs on
cromwell-executions/chip/[RANDOM_HASH_STRING]/
. See output directory structure for details. -
See full specification for input JSON file.
-
Add the following line to your BASH startup script (
~/.bashrc
or~/.bash_profile
).module load system singularity
-
Pull a singularity container for the pipeline. This will pull pipeline's docker container first and build a singularity one on
~/.singularity
. Stanford Sherlock does not allow building a container on login nodes. Wait until you get a command prompt aftersdev
.$ sdev # sherlock cluster does not allow building a container on login node $ SINGULARITY_PULLFOLDER=~/.singularity singularity pull docker://quay.io/encode-dcc/chip-seq-pipeline:v1.1 $ exit # exit from an interactive node
-
Run a pipeline for a SUBSAMPLED (1/400) paired-end sample of ENCSR936XTK.
$ source activate encode-chip-seq-pipeline # IMPORTANT! $ INPUT=examples/sherlock/ENCSR936XTK_subsampled_sherlock.json $ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm_singularity cromwell-34.jar run chip.wdl -i ${INPUT} -o workflow_opts/sherlock.json
-
It will take about an hour. You will be able to find all outputs on
cromwell-executions/chip/[RANDOM_HASH_STRING]/
. See output directory structure for details. -
See full specification for input JSON file.
-
IF YOU WANT TO RUN PIPELINES WITH YOUR OWN INPUT DATA/GENOME DATABASE, PLEASE ADD THEIR DIRECTORIES TO
workflow_opts/sherlock.json
. For example, you have input FASTQs on/your/input/fastqs/
and genome database installed on/your/genome/database/
then add/your/
to--bind
insingularity_command_options
. You can also define multiple directories there. It's comma-separated.{ "default_runtime_attributes" : { "singularity_container" : "~/.singularity/atac-seq-pipeline-v1.1.simg", "singularity_command_options" : "--bind /scratch,/oak/stanford,/your/,YOUR_OWN_DATA_DIR1,YOUR_OWN_DATA_DIR1,..." } }
-
If you want to run multiple (>10) pipelines, then run a cromwell server on an interactive node. We recommend to use
screen
ortmux
to keep your session alive and note that all running pipelines will be killed after walltime. Run a Cromwell server with the following commands.$ srun -n 2 --mem 5G -t 3-0 --qos normal -p [YOUR_SLURM_PARTITION] --pty /bin/bash -i -l # 2 CPU, 5 GB RAM and 3 day walltime $ hostname -f # to get [CROMWELL_SVR_IP]
For Conda users,
$ source activate encode-chip-seq-pipeline $ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm cromwell-34.jar server
For singularity users,
$ _JAVA_OPTIONS="-Xmx5G" java -jar -Dconfig.file=backends/backend/conf -Dbackend.default=slurm_singularity cromwell-34.jar server
-
You can modify
backend.providers.slurm.concurrent-job-limit
orbackend.providers.slurm_singularity.concurrent-job-limit
inbackends/backend.conf
to increase maximum concurrent jobs. This limit is not per sample. It's for all sub-tasks of all submitted samples. -
On a login node, submit jobs to the cromwell server. You will get
[WORKFLOW_ID]
as a return value. Keep these workflow IDs for monitoring pipelines and finding outputs for a specific sample later.$ INPUT=YOUR_INPUT.json $ curl -X POST --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1" \ -F [email protected] \ -F workflowInputs=@${INPUT} \ -F workflowOptions=@workflow_opts/sherlock.json
To monitor pipelines, see cromwell server REST API description for more details. squeue
will not give you enough information for monitoring jobs per sample.
$ curl -X GET --header "Accept: application/json" -v "[CROMWELL_SVR_IP]:8000/api/workflows/v1/[WORKFLOW_ID]/status"