This directory contains scripts and files to run the bioinformatic analysis of Zika genomes from the USVI. Broadly, there are 3 distinct pipelines.
- Pipeline for genomes sequenced on the Illumina MiSeq.
- Pipeline for genomes sequenced on the MinION, that runs on the Rhino cluster at Fred Hutch.
- Pipeline for genomes sequenced on the MinION, run via a Docker image (currently deprecated).
The vast majority of this README describes the bioinformatic pipeline to analyze MinION data on the Rhino cluster. Information on the MiSeq pipeline can be found in the miseq_pipeline directory.
For explanation of the Docker pipeline, see docker_README.md
.
This pipeline uses the Conda package manager and Snakemake to compile the entire pipeline quickly on a laptop computer. One run of the pipeline will correspond to one sequencing library. Note that this changes may be necessary if run on an operating system other than Ubuntu 16.04 LTS.
- Download Albacore
Download the .deb
for Albacore from the ONT Community. If a version other than 2.0.2 is downloaded, modify the dpkg
command in install.sh
appropriately. Note that you need to have login credentials with Nanopore to access the Albacore software.
- Download and install the pipeline from the github repo:
git clone https://github.com/blab/zika-seq.git
cd zika-seq
git checkout create_wrap
If you are running Ubuntu 16.04:
./install_Ubuntu16_04.sh
If you are running Mac OSX:
./install_MacOSX.sh
If ./<install_script>
fails, you may need to run chmod 700 <install_script>
before rerunning.
Note: If you are running the install on Mac OSX, there's a possibility that nanopolish will not install properly. If so, you can do a manual install of nanopolish with:
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish
make CXX=g++-7 CC=gcc-7
Then, add nanopolish to your $PATH with export PATH=$PATH:~/your/path/to/nanopolish/
We've found that the install tends to fail if you make without referencing these exact versions of g++ and gcc. If this still doesn't work, it might be an issue with Xcode install of CommandLineTools (we've found that this happens on OSX High Sierra). In this case, try running xcode-select --install
to properly install CommandLineTools, then re-cloning and making the nanopolish directory.
-
Make any directories that you have specified in the config file that do not already exist. We usually put these directories in the repo itself. If it is your very first time running the repo you'll need to make
build
anddata
directories. For subsequent runs, you'll need to make library-specific directories that containbasecalled_reads
,process
, anddemux
directories. -
Open
cfg.py
and change config information as appropriate:
raw_reads
: directory containing un-basecalled.fast5
numbered directories.dimension
: sequencing dimension (1d or 2d)demux_dir
: path to directory where demultiplexing will take placebuild_dir
: path to output location (zika-seq/build
)samples
: list of all samples that are included for the library that will be processedalbacore_config
: name of the config file to be used during basecalling by Albacoreprefix
: prefix to prepend onto output consensus genome filenames
Reminder: Make sure that all paths to directory paths listed in demux_dir
, build_dir
exist prior to running snakemake
.
- Run the pipeline:
source activate zika-seq
snakemake --use-conda
If this is your first time using conda
and snakemake
, you'll need to add the path to your version of miniconda to your bash profile. To check whether a path already exists, do echo $PATH
. If there isn't a path to conda in the bash profile already, make one using export PATH=$PATH:your/path/to/miniconda3/bin
.
Remember to move all final files that you want to keep to a separate directory, as everything in the build
directory will be overwritten next time you run the pipeline (with a different config file).
If you had a single sample that failed, rather than re-running the pipeline on all the samples again, change the config file so that the pipeline will only run on the single failed sample.
When Albacore basecalls the raw fast5
reads, it makes a workspace
directory, which then contains a pass
and a fail
directory. Both the pass
and the fail
directory contain a single fastq
. The fastq in the pass
directory contains the high quality reads, and the pipeline only uses this fastq
.
The basecalled fastq
file serves as input to porechop
, the program which performs barcode demultiplexing. Porechop writes a single fastq
file for each barcode to the demux
directory you created.
Next we run pipeline.py
, which is a large script that references other custom python scripts and shell scripts to do run every other step of the pipeline. This is what is occurring in pipeline.py
.
-
Map sample IDs and run information that is contained in the
samples.tsv
and theruns.tsv
files. This allows us to know which sample IDs are linked to which barcodes and primers etc. -
Convert demuxed
fastq
files tofasta
files. (If you look in thedemux
directory after a pipeline run you'll see both file types present for each barcode). -
Using the linked sample and run information, combine the two barcode
fastq
orfasta
files that represent pool 1 and pool 2 of the same sample. The pool-combined files are written as<sample ID>.fasta
and<sample ID>.fastq
to thebuild
directory you made previously. -
Map the reads in
<sample ID>.fasta
to a reference sequence usingbwa-mem
, and pipe tosamtools
to get a sorted bam file. Output files are labeled<sample ID>.sorted.bam
. -
Trim primer sequences out of the bam file using the custom script
align_trim.py
. Output files are labeled<sample ID>.trimmed.sorted.bam
. -
Use
nanopolish
to call SNPs more accurately. This is a two step process. First step is usingnanopolish index
to create a map of how basecalled reads in<sample ID>.fastq
are linked to the raw reads (signal level data). This step takes a while since for every sample all the program needs to iterate through all the raw reads. Next,nanopolish variants
will walk through the indexedfastq
and a reference file, determine which SNPs are real given the signal level data, and write true SNPs to a VCF file. -
Run
margin_cons.py
to walk through the reference sequence, the trimmed bam file, and the VCF file. This script looks at read coverage at a site, masking the sequence with 'N' if the read depth is below a hardcoded threshold (we use a threshold of 20 reads). If a site has sufficient coverage to call the base, either the reference base or the variant base (as recorded in the VCF) is written to the consensus sequence. Consensus sequences are written to<sample ID>_complete.fasta
. The proportion of the genome that has over 20x coverage and over 40x coverage is logged to<sample_ID>-log.txt
. -
Consensus sequences are bundled together into
good
,partial
, orpoor
files depending on the percent of the genome that was sequenced. Good quality genomes are >80% complete, partial genomes are between 50% and 80% complete, and poor genomes are <50% complete.
Different modules are required for different parts of the pipeline; they can be loaded using module load <module name>
from Rhino. Descriptions of when each module should be loaded are in the step by step instructions below.
- Basecalling:
Python/3.5.2-foss-2016b-fh2
: Albacore 1.2.1
- Barcode demultiplexing:
Python/2.7.13-foss-2016b-fh2
: PoretoolsPython/3.5.2-foss-2016b-fh1
: Porechop
- Consensus genome generation: (all necessary for
pipeline.py
step)module load Python/3.6.1-foss-2016b-fh1
R/3.4.0-foss-2016b-fh1
BWA/0.7.15-foss-2016b
SAMtools/1.3.1-foss-2016b
nanopolish/0.7.1-foss-2016b
: Breaks one step of pipeline, waiting for update from SciComp.
- <RAW_READS_DIRECTORY> needs to be a global path, for instance
/fh/fast/bedford_t/zika-seq/data/<RUN>/raw_reads/
- <BASECALLED_READS_DIRECTORY> is assumed to be
/fh/fast/bedford_t/zika-seq/data/<RUN>/alba121/
bypipeline.py
. If this is not the case, errors will arise on step 4.2. - <CONFIG_FILE> is
r94_250bps_nsk007_2d.cfg
for 2D reads andr94_450bps_linear.cfg
for 1D reads. - <SAMPLES_TO_RUN> are, for example:
VI1 VI2 VI3
.
- Load Albacore with
module load Python/3.5.2-foss-2016b-fh2
. - Sbatch out the Albacore job as follows
sbatch --time=48:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="read_fast5_basecaller.py -i <RAW_READS_DIRECTORY> -t 8 --config <CONFIG_FILE> -r -s <BASECALLED_READS_DIRECTORY> -o fast5
Albacore writes basecalled fast5
files into subdirectories within the workspace
directory. Each directory is given a number (starts at 0/
). Albacore writes 4000 files into a directory, and then makes a new directory, in numerical order. In the end, you'll have a lot of subdirectories within workspace. Importantly, Porechop, which we use for demultiplexing the barcoded reads, requires that all the files be in a single directory. Therefore, the first step once you have basecalled reads is to use prep_lib.py
to move all the files out of the subdirectories into a single directory. This script also makes a directory called demux/
, which is where Porechop will write the demultiplexed fasta files.
- Submit the library prep job as
sbatch --time=48:00:00 --mem=10000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/basecall/prep_lib.py --library <LIBRARY_NUMBER>
.
Note: Your library might be huge, in which case there will be too many files in workspace
for Porechop to handle.
If this is the case, then you will need to split the files back up into smaller subdirectories, and you'll need to submit a Porechop job for each subdirectory. If your library has more than 4 million reads, you can use the split_1d_library.py
to sort reads into directories that contain 500,000 reads per directory.
- Load Poretools with
ml nanopolish/0.7.1-foss-2016b
(Uses Python 2). - Change working directory to
<BASECALLED_READS_DIRECTORY>/workspace/demux/
. - Submit job as
sbatch --time=96:00:00 --mem=32000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="$EBROOTNANOPOLISH/nanopolish extract -b albacore -t template -o nanopolish_full.fasta ../"
. - Load Porechop with
module load Python/3.5.2-foss-2016b-fh1
(uses Python 3). - Submit job as
sbatch --time=24:00:00 --mem=20000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="porechop -i <FILENAME.fasta> -b . --barcode_threshold 75 --threads 16 --check_reads 100000"
. - Once job completes, run
gunzip NB*
from within the directory to unzip the files in preparation for consensus genome generation.
Porechop will write out a fasta file for each barcode in the demux/
directory (for example, demux/NB01.fasta
). If you had a large library, and you split basecalled reads into subdirectories of 500,000 files each, and submitted a Porechop job for each subdirectory, then you'll have multiple fastas with the demultiplexed reads. In such cases, use cat_demux_fastas.py
to consolidate all of the reads into a single fasta for each barcode.
- Change working directory to
/fh/fast/bedford_t/zika-seq/
- Load all of the following modules (
ml
is a default Rhino alias formodule load
):
ml R/3.4.0-foss-2016b-fh1
ml Python/3.6.1-foss-2016b-fh1
ml BWA/0.7.15-foss-2016b
ml SAMtools/1.3.1-foss-2016b
ml nanopolish/0.7.1-foss-2016b
- Submit the job as
sbatch --time=48:00:00 --mem=30000 --mail-type=END,FAIL --mail-user=<EMAIL_ADDRESS> --wrap="python pipeline/scripts/pipeline.py --samples <SAMPLES_TO_RUN> --dimension <DIMENSION>"
.
The pipeline.py
script does all the heavy lifting in terms of alignment to a reference, calling variants, and writing consensus genomes. Details are in a separate README
. Note that output is written to the build/
directory.
Contains scripts and files relating to the demultiplexing of basecalled reads.
Contains scripts used to prepare libraries for basecalling. This directory is a work in progress, it will be made more user friendly soon.
Contains the majority of the scripts necessary to make consensus genomes from basecalled, demultiplexed reads, as well as for the generation of summary statistics and figures.