March 2024 - only use the Guppy basecaller if you really need to. Guppy has been replaced by the Dorado basecaller by Oxford Nanopore. Dorado is recommended since it produces higher-accuracy basecalls.
The motivation behind this (now-outdated) project was to speed up the Guppy basecaller when running on CPU only - we didn't have proper GPUs in the Slurm HPC cluster back then. We then got nice A100 GPUs so could use the GPU guppy script runbatch_gpu_guppy.sh
. These days we only use the Dorado basecaller from ONT, not Guppy. If using CPU based guppy on SLURM for some reason, I would not run this on any big data (i.e. Minion scale data is probably going to be the limit, unless you are really, really patient).
These scripts move FAST5s into subdirectories, then run CPU guppy on each subdirectory independently using a SLURM cluster.
Guppy is really slow on CPU, but incredibly quick on GPU (100-1000X faster on GPU!). After torturing our 1000+ core cluster for multiple days with CPU basecalling for MinION runs, we finally moved to a performant Nvidia GPU graphics card for basecalling. We use the script runbatch_gpu_guppy.sh
for this. When using the GPU version, you want to have all your FAST5 files in a single directory, i.e. you do not need to run bash batch_split_to_subdirs.sh
to split the FAST5 files into subdirectories.
Warning: only tested on Ubuntu 16.04 and 20.04 to date (not Windows). Guppy works fine on Ubuntu 20.04, use a singularity container or the native version from ONT.
- either a working guppy CPU guppy_basecaller (the command
guppy_basecaller
should provide output) - or a singularity guppy CPU container with guppy_basecaller installed in it.
- or a Nvidia GPU with CUDA installed
-
Clone the repository
-
Edit the
run_guppy_SLURM.sh
script with your local SLURM
- a) queue / partition
- b) cpus (Go easy on your HPC because guppy very efficiently uses resources. Request 10 HPC cores if you're going to run guppy on 8 cores).
-
Copy the bash scripts into the directory containing your reads
-
Make sure the Guppy software from ONT is installed and in your PATH. This following command should provide program usage:
guppy_basecaller
-
Split FAST5 files into batches for cluster using the bash script. You may/will need to optimize batch size depending on your run and server hardware to get optimal runtimes.
bash batch_split_to_subdirs.sh
-
Run guppy to submit a 8 core job (default) for each subdir directory concurrently.
runbatch_singularity_guppy.sh
(preferred) or bash runbatch_guppy.sh
- Finally, gather FASTQ data. Warning: This will combine barcodes, and all pass (but not fail) reads together into one file so might only be appropriate for those doing one sample per flowcell ! find exec cat. Warning - needs to write output into a directory above the find, eg using ../x.fastq for the output file, otherwise creates an infinite loop.
srun find *.guppy/pass -name "*.fastq.gz" -exec cat {} > ../guppy_part1.fastq.gz \; &
Done
If you have any questions please raise an issue in this repository.