Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cite update #2

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
95 changes: 73 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,50 @@
**fast5_fetcher** is a tool for fetching nanopore fast5 files to save time and simplify downstream analysis.


## **fast5_fetcher is now part of SquiggleKit located [here](https://github.com/Psy-Fer/SquiggleKit)**
### Please use and cite SquiggleKit as it is the most up to date


[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)

## Contents

<!--ts-->

- [Background](#background)
- [Requirements](#requirements)
- [Installation](#installation)
- [Getting Started](<#getting started>)
- [File structures](<#file structures>)
- [1. Raw structure (not preferred)](<#1. Raw structure>)
- [2. Local basecalled structure](<#2. Local basecalled structure>)
- [3. Parallel basecalled structure](<#3. Parallel basecalled structure>)
- [Inputs](#inputs)
- [Instructions for use](<#Instructions for use>)
- [Quick start](<#Quick start>)
- [fast5_fetcher.py](#fast5_fetcher.py)
- [Examples](#Examples)
- [batch_tater.py](#batch_tater.py)
- [Acknowledgements](#acknowledgements)
- [Cite](#cite)
- [License](#license)
<!--te-->

# Background

Reducing the number of fast5 files per folder in a single experiment was a welcomed addition to MinKnow. However this also made it rather useful for manual basecalling on a cluster, using array jobs, where each folder is basecalled individually, producing its own `sequencing_summary.txt`, `reads.fastq`, and reads folder containing the newly basecalled fast5s. Taring those fast5 files up into a single file was needed to keep the sys admins at bay, complaining about our millions of individual files on their drives. This meant, whenever there was a need to use the fast5 files from an experiment, or many experiments, unpacking the fast5 files was a significant hurdle both in time and disk space.

**fast5_fetcher** was built to address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.
**fast5_fetcher** was built to address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.

# Requirements

Following a self imposed guideline, most things written to handle nanopore data or bioinformatics in general, will use as little 3rd party libraries as possible, aiming for only core libraries, or have all included files in the package.

In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next update)
In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next big update)

There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed, there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration.
##### Operating system:

There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed (see below), there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration.

# Getting Started

Expand Down Expand Up @@ -187,6 +214,14 @@ Download the repository:

git clone https://github.com/Psy-Fer/fast5_fetcher.git

If using MacOS, and NOT using homebrew, install it here:

https://brew.sh/

then install gnu-tar with:

brew install gnu-tar

### Quick start

Basic use on a local computer
Expand Down Expand Up @@ -221,25 +256,35 @@ See examples below for use on an **HPC** using **SGE**

#### Full usage

usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [-s SEQ_SUM]
[-i INDEX] [-o OUTPUT] [-z]
usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [--OSystem OSYSTEM]
[-s SEQ_SUM] [-i INDEX] [-o OUTPUT] [-t]
[-l TRIM_LIST] [-x PREFIX] [-z]

fast_fetcher - extraction of specific nanopore fast5 files

optional arguments:
-h, --help show this help message and exit
-q FASTQ, --fastq FASTQ
fastq.gz for read ids
-p PAF, --paf PAF paf alignment file for read ids
-f FLAT, --flat FLAT flat file of read ids
-s SEQ_SUM, --seq_sum SEQ_SUM
sequencing_summary.txt.gz file
-i INDEX, --index INDEX
index.gz file mapping fast5 files in tar archives
-o OUTPUT, --output OUTPUT
output directory for extracted fast5s
-z, --pppp Print out tar commands in batches for further
processing
-h, --help show this help message and exit
-q FASTQ, --fastq FASTQ
fastq.gz for read ids
-p PAF, --paf PAF paf alignment file for read ids
-f FLAT, --flat FLAT flat file of read ids
--OSystem OSYSTEM running operating system - leave default unless doing
odd stuff
-s SEQ_SUM, --seq_sum SEQ_SUM
sequencing_summary.txt.gz file
-i INDEX, --index INDEX
index.gz file mapping fast5 files in tar archives
-o OUTPUT, --output OUTPUT
output directory for extracted fast5s
-t, --trim trim files as if standalone experiment, (fq, SS)
-l TRIM_LIST, --trim_list TRIM_LIST
list of file names to trim, comma separated. fastq
only needed for -p and -f modes
-x PREFIX, --prefix PREFIX
trim file prefix, eg: barcode_01, output:
barcode_01.fastq, barcode_01_seq_sum.txt
-z, --pppp Print out tar commands in batches for further
processing

## Examples

Expand Down Expand Up @@ -306,13 +351,19 @@ CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_v
echo $CMD && $CMD
```

## Trimming fastq and sequencing_summary files

By using the `-t, --trim` option, each barcode will also have its own sequencing_summary file for downstream analysis. This is particularly useful if each barcode is a different sample or experiment, as the output is as if it was it's own individual flowcell.

This method can also trim fastq, and sequencing_summary files when using the **paf** or **flat** methods. By using the prefix option, you can label the output names, otherwise generic defaults will be used.

## batch_tater.py

Potato scripting engaged

This is designed to run on the output files from `fast5_fetcher.py` using option `-z`. This writes out file lists for each tarball that contains reads you want to process. Then `batch_tater.py` can read those files, to open the individual tar files, and extract the files, meaning the file is only opened once.

A recent test using the -z option on ~1.4Tb of data, across ~16/20 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 2min per array job to extract and repackage with batch_tater.py.
A recent test using the -z option on ~2.2Tb of data, across ~11/27 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 20s per array job to extract and repackage with batch_tater.py.

This is best used when you want to do something all at once and filter your reads. Other approaches may be better when you are demultiplexing.

Expand All @@ -331,7 +382,7 @@ BLAH=fast5/${FILE}

mkdir ${TMPDIR}/fast5

time python batch_tater.py name.index.gz ${BLAH} ${TMPDIR}/fast5/
time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/

echo "size of files:" >&2
du -shc ${TMPDIR}/fast5/ >&2
Expand All @@ -355,13 +406,13 @@ echo $CMD && $CMD

## Acknowledgements

I would like to thank the rest of my lab in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool.
I would like to thank the rest of my lab (Shaun Carswell, Kirston Barton, Kai Martin) in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool.

## Cite

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)

James M. Ferguson. Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher. (2018). doi:10.5281/zenodo.1413903
James M. Ferguson, & Martin A. Smith. (2018, September 12). Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher (Version v1.0). Zenodo. <http://doi.org/10.5281/zenodo.1413903>

## License

Expand Down
30 changes: 20 additions & 10 deletions batch_tater.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

mkdir ${TMPDIR}/fast5

time python batch_tater.py tar_index.txt ${BLAH} ${TMPDIR}/fast5/
time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/

echo "size of files:" >&2
du -shc ${TMPDIR}/fast5/ >&2
Expand All @@ -44,30 +44,40 @@
Launch:

echo $CMD && $CMD


stats:

fastq: 27491304
mapped: 11740093
z mode time: 10min
batch_tater total time: 21min
per job time: ~28s
number of CPUs: 100
'''

# being lazy and using sys.argv...i mean, it is pretty lit
index = sys.argv[1]
master = sys.argv[1]
tar_list = sys.argv[2]
save_path = sys.argv[3]

# this will probs need to be changed based on naming convention
# I think i was a little tired when I wrote this
tar_name = '.'.join(tar_list.split('/')[-1].split('.')[:3])
list_name = tar_list.split('/')[-1]

PATH = 0

# for stats later and easy job relauncing
print >> sys.stderr, "extracting:", tar_name

# not elegent, but gets it done
with open(index, 'r') as f:
with open(master, 'r') as f:
for l in f:
l = l.strip('\n')
if tar_name in l.split('/'):
PATH = l
l = l.split('\t')
if l[0] == list_name:
PATH = l[1]
break

# for stats later and easy job relauncing
print >> sys.stderr, "extracting:", tar_list
# do the thing. That --transform hack is awesome. Blows away all the leading folders.
if PATH:
cmd = "tar -xf {} --transform='s/.*\///' -C {} -T {}".format(
Expand All @@ -76,4 +86,4 @@

else:
print >> sys.stderr, "PATH not found! check index nooblet"
print >> sys.stderr, "inputs:", index, tar_list, tar_name
print >> sys.stderr, "inputs:", master, tar_list, save_path
Loading