Scripts for the generation of high accuracy single molecule Nanopore reads using the MrHAMER pipeline (Multi-read Hairpin Mediated Error-correction Reaction)
We recommend installing in a Python or Conda environment to ensure recommended versions are installed. Pipeline is compatible with Ubuntu 16 and 18 LTS.
- Python 2.7.12
- Python 3.5.2
- Guppy basecaller v3.6.0
- Porechop v0.2.4
- Filtlong v0.2.0
- minimap2 v2.17-r954-dirty
- Racon v1.4.3
- samtools v1.9 (with htslib 1.9)
- Medaka v1.0.1
- Pomoxis v0.2.3
- CoVaMa v0.7
- MAFFT v7.471
- CliqueSNV v1.5.4
To install these MrHAMER scripts run the following commands:
git clone --recursive https://github.com/gallardo-seq/MrHAMER.git
After successful downloading of the scripts, a folder named MrHAMER
will appear in current working directory.
Usage of MrHAMER
is as following:
-
Combine all sequencing reads into single FASTQ file
-
Use Porechop to segment the concatemers based on the presence of MrHAMER hairpin sequence (this requires a custom adapters.py file, a template file is included in this repository)
porechop -i [combined.fastq] -o [porechop.output] -t [threads] --extra_middle_trim_bad_side 0 --extra_middle_trim_good_side 0
-
Filter porechop.output with Filtlong
filtlong --min_length 4000 [porechop.output] > [filtlong.output]
-
Demultiplexing of reads processed with Porechop and filtered with Filtlong, and filtering for minimum number of repetitive units per single molecule concatemer. This results in a folder that contains single FASTQ files, each containing a multiple number of repetitive units used for error correction in the next step.
python2 ./qfilesplitterV3.1.py -i [filtlong.output] -o [output path] -b [min. number of repetitive units] python qfilespliter.py [Arguments] Arguments: -i input file -o output path -b blocks size cutoff [optional]
-
Running parallel instances of minimap2 > racon > medaka to polish each FASTQ file, resulting in high accuracy single molecule sequences. This step has been optimized for a system running 40 threads.
python3 protocolV3.3.py -q [path to output folder from previous step] -r [path to reference sequence] -m r941_min_high_g360 python protocol.py [Arguments] Arguments: -q fastq files -r reference -n number of iterations [Default 1] -m model for medaka [Default r941_min_high] -noMedaka if the parameter is present exclude medaka from the process -noRacon if the parameter is present exclude racon from the process
-
High accuracy single molecule sequences are output in new directory called "medaka_output", with high accuracy single molecule sequences concatenated in a single medaka_consensus.fasta file within this directory.
**A note about reference sequence used for Step 5. This pipeline is optimized for reference-based alignment. For a de-novo based approach, the outputs of Step 4 can be used with the "medaka smolecule" module, which used SPOA to generate a reference assembly for each originating FASTQ file (https://github.com/nanoporetech/medaka).
For additional information, help and bug reports please send an email to [email protected]
This work was supported by the National Institute of Allergy and Infectious Diseases [U54AI150472 to BET and ALR, P30AI036214-26 to BET, SJL and DMS]; the National Human Genome Research Institute [R01HG009622 to BET]; the Scripps Translational Science Institute [UL1TR001114-03 to BET]; and the University of Texas System Rising STARs Award to ALR.