All nf-core pipelines have been successfully configured for use on the AWS Batch at the Chan Zuckerberg Biohub here.
To use, run the pipeline with -profile czbiohub_aws
. This will download and launch the czbiohub_aws.config
which has been pre-configured with a setup suitable for the AWS Batch. Using this profile, a docker image containing all of the required software will be downloaded, and converted to a Singularity image before execution of the pipeline.
Ask Olga ([email protected]) if you have any questions!
The pipeline will monitor and submit jobs to AWS Batch on your behalf. To ensure that the pipeline is successful, it will need to be run from a computer that has constant internet connection. Unfortunately for us, Biohub has spotty WiFi and even for short pipelines, it is highly recommended to run them from AWS.
tmux is a "Terminal Multiplexer" that allows for commands to continue running even when you have closed your laptop. Start a new tmux session with tmux new
and we'll name this session nextflow
.
tmux new -n nextflow
Now you can run pipelines with abandon!
To make sharing your pipelines and commands easy between your teammates, it's best to share code in a GitHub repository. One way is to store the commands in a Makefile (example) which can contain multiple nextflow run
commands so that you don't need to remember the S3 bucket or output directory for every single one. Makefiles are broadly used in the software community for running many complex commands. Makefiles can have a lot of dependencies and be confusing, so we're only going to write simple Makefiles.
rnaseq:
nextflow run -profile czbiohub_aws nf-core/rnaseq \
--reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
--genome GRCm38 \
--outdir s3://olgabot-maca/nextflow-test/
Human_Mouse_Zebrafish:
nextflow run czbiohub/nf-kmer-similarity -latest -profile aws \
--samples s3://kmer-hashing/hematopoeisis/smartseq2/human_mouse_zebrafish/samples.csv
Merkin2012_AWS:
nextflow run czbiohub/nf-kmer-similarity -latest --sra "SRP016501" \
-r olgabot/support-csv-directory-or-sra \-profile aws
In this example, one would run the rnaseq
rule and the nextflow command beneath it with:
make rnaseq
If one wanted to run a different command, e.g. human_mouse_zebrafish
, they would specify that command instead. For example:
make human_mouse_zebrafish
Makefiles are a very useful way of storing longer commands with short mnemonic words.
Once you create a new repository (best to initialize with a .gitignore
, license - MIT and README
), clone that repository to your EC2 instance. For example, if the repository is called kh-workflows
, this is what the command would look like:
git clone https://github.com/czbiohub/kh-workflows
Now both create and edit a Makefile
:
cd
nano Makefile
Write your rule with a colon after it, and on the next line must be a tab, not spaces. Once you're done, exit the program (the ^
command shown in nano means "Control"), write the file, add it to git, commit it, and push it up to GitHub.
git add Makefile
git commit -m "Added makefile"
git push origin master
Remember to specify -profile czbiohub_aws
to grab the CZ Biohub-specific AWS configurations, and an --outdir
with an AWS S3 bucket so you don't run out of space on your small AMI
nextflow run -profile czbiohub_aws nf-core/rnaseq \
--reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
--genome GRCm38 \
--outdir s3://olgabot-maca/nextflow-test/
If you close your laptop, get onto the train, or lose WiFi connection, you may lose connection to AWS and may need to restart the jobs. To reattach, use the command tmux attach
and you should see your Nextflow output! To get the named session, use:
tmux attach -n nextflow
To restart the jobs from where you left off, add the -resume
flag to your nextflow
command:
nextflow run -profile czbiohub_aws nf-core/rnaseq \
--reads 's3://czb-maca/Plate_seq/24_month/180626_A00111_0166_BH5LNVDSXX/fastqs/*{R1,R2}*.fastq.gz' \
--genome GRCm38 \
--outdir s3://olgabot-maca/nextflow-test/ \
-resume
It's important that this command be re-run from the same directory as there is a "hidden" .nextflow
folder that contains all the metadata and information about previous runs.
A local copy of the iGenomes resource has been made available on s3://czbiohub-reference/igenomes
(in us-west-2
region) so you should be able to run the pipeline against any reference available in the igenomes.config
specific to the nf-core pipeline.
You can do this by simply using the --genome <GENOME_ID>
parameter.
For Human and Mouse, we use GENCODE gene annotations. This doesn't change how you would specify the genome name, only that the pipelines run with the czbiohub_aws
profile would be with GENCODE rather than iGenomes.
NB: You will need an account to use the HPC cluster on PROFILE CLUSTER in order to run the pipeline. If in doubt contact IT. NB: Nextflow will need to submit the jobs via the job scheduler to the HPC cluster and as such the commands above will have to be executed on one of the login nodes. If in doubt contact IT.
If you would like to run with the High Priority queue, specify the highpriority
config profile after czbiohub_aws
. When applied after the main czbiohub_aws
config, it overwrites the process queue
identifier.
To use it, submit your run with with -profile czbiohub_aws,highpriority
.
Note that the order of config profiles here is important. For example, -profile highpriority,czbiohub_aws
will not work.