An introduction to tools for reproducible analysis pipelines including Git, Singularity, and Nextflow.
- a terminal/ssh client - we recommend Git Bash for windows; terminal for macOS (terminal comes with macOS, so no installation is necessary!)
- a GitHub account, with an SSH key added from the laptop you're bringing to the workshop
We will be giving everyone a Google Compute Engine instance to work with during the workshop. IPs will be announced at the beginning of the workshop, and any SSH keys associated with your GitHub account will allow you to login!
We login using secure shell, a protocol for connecting to remote servers ssh yourGitHubUsername@yourcloudIP
We setup an external mount for everyone to pull data from (mounted at /data). The first thing you'll need to do is get a copy of this data to work with. To keep things tidy, we'll make a new folder to put it in, first. Note: the ~ is a shortcut for your home directory on unix-like systems
- First, ensure you're in your home directory:
cd ~
- Then, make a new directory called workshop to put the data in:
mkdir workshop
- Lastly, copy the data from the mount:
cp -pr /data/* ~/workshop
- Fork this repository so you have your own copy. You can fork the repository by clicking on the 'fork' button in the top right of the base page for the repository on GitHub.
- Ensure you're in your home directory:
cd ~
- Clone your new repository on the Google cloud instance:
git clone https://github.com/YOURUSERNAME/rmghc-workshop-19.git
Note: We're using https cloning so we don't have to generate and add ssh keys from the cloud instance - When you type
ls
, you should see a folder named for the repository
We've set the default editor in the compute instances to nano, so if you type something like git commit
without specifying a commit message on the command line, nano will open. We find that this has a smaller learning curve. We aren't doing much editing anyhow, but you can use whatever method you like for editing files. If you're comfortable with vim, it's available also. If you know how to rsync/scp files to and from your laptop and you want to sync changes that way and locally edit, that's fine!
First let's take a look a what the basic options of Singularity are for
$ singularity --help
Here you will see several of the commands we are going to use today: "run", "exec", "build", and "inspect"
Build a local image directly from a remote repository.
$ cd ~
$ singularity build VerySerious.img shub://GodloveD/lolcow
We now have a local Singularity image called VerySerious.img which can be executed. Use the "run" command to execute the pre-built function of the container.
$ singularity run VerySerious.img
In this case, creating the image wasn't totally necessary and we can execute this same command pulling directly from the Singularity hub.
$ singularity run shub://GodloveD/lolcow
We previously built an image from the Singularity hub, we can also do the same from the Docker hub.
$ singularity build graphtool.img docker://tiagopeixoto/graph-tool:latest
To help illustrate how you can execute a command that doesn't exist on your local machine, attempt to run samtools help dialog locally.
$ samtools --version
Now use the exec command to run the same command inside of the container
$ singularity exec /data/singularity/RinnLab_RNASeq.6.13.img samtools --version
Depending on how a local image is created, it will show you different outputs with the inspect command.
$ singularity inspect VerySerious.img
$ singularity inspect -d VerySerious.img
You will see much more information on images built from Singularity files locally.
$ singularity inspect /data/singularity/RinnLab_RNASeq.6.13.img
$ singularity inspect -d /data/singularity/RinnLab_RNASeq.6.13.img
The script test.nf has a few input parameters and nothing else. Let's run it using
nextflow run test.nf
Now, let's modify the script to include another parameter called email and add a line to log the email. Then run it again.
We can view the output of the reads
channel with the command .println()
.
Since a channel can only be consumed once, let's remove the println
statement in order to add a process.
Now let's create an empty process that will view the beginning of the read files.
process view_reads {
input:
output:
script:
"""
"""
}
The input needs to be the reads
channel and the output will be a text file which we can specify with *.txt
.
Now let's add in the actual command that we wan't to run so that the script looks like this:
script:
"""
zcat ${read_files[[1]]} | head > ${sample_id}_reads.txt
"""
Lastly, let's publish the results of this process to a directory that is more interpretable with publishDir "results"
which we will place above the input:
line.
Now let's run nextflow again and view the output in the results
directory.
nextflow run main.nf \
-resume \
-with-report /srv/http/nextflow_report.html \
-with-dag /srv/http/flowchart.html
The cloud instances supplied for the workshop have a web server that's already up and running, so we don't have to waste time copying files between the cloud instance and our laptops to view them! Simply copy any pertinent HTML files you want to view to /srv/http and then type http://yourip into your web browser on your laptop!
The input channels to the process are annotation_for_de, salmon_for_de.collect(), and sample_info_for_de. The output that we care about is an html file. This can be specified to Nextflow with file "*.html"
.
script:
"""
cp ${baseDir}/bin/*.R* .
Rscript -e 'rmarkdown::render("differential_expression.Rmd", params = list(annotation_file = "${annotation}"))'
"""
Just like our Nextflow reports, we can just copy our RMarkdown output files (which are HTML) to the web root on the cloud instance, then type http://yourip into your web browser on your laptop.