KCCG · Psy-Fer · Sep 15, 2018 · Oct 21, 2018 · Oct 21, 2018 · Oct 30, 2018
diff --git a/README.md b/README.md
@@ -7,23 +7,50 @@
 **fast5_fetcher** is a tool for fetching nanopore fast5 files to save time and simplify downstream analysis.
 
 
+## **fast5_fetcher is now part of SquiggleKit located [here](https://github.com/Psy-Fer/SquiggleKit)**
+### Please use and cite SquiggleKit as it is the most up to date
+
+
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)
 
 ## Contents
 
+<!--ts-->
+
+-   [Background](#background)
+-   [Requirements](#requirements)
+-   [Installation](#installation)
+-   [Getting Started](<#getting started>)
+    -   [File structures](<#file structures>)
+        -   [1. Raw structure (not preferred)](<#1. Raw structure>)
+        -   [2. Local basecalled structure](<#2. Local basecalled structure>)
+        -   [3. Parallel basecalled structure](<#3. Parallel basecalled structure>)
+    -   [Inputs](#inputs)
+-   [Instructions for use](<#Instructions for use>)
+    -   [Quick start](<#Quick start>)
+    -   [fast5_fetcher.py](#fast5_fetcher.py)
+        -   [Examples](#Examples)
+    -   [batch_tater.py](#batch_tater.py)
+-   [Acknowledgements](#acknowledgements)
+-   [Cite](#cite)
+-   [License](#license)
+    <!--te-->
+
 # Background
 
 Reducing the number of fast5 files per folder in a single experiment was a welcomed addition to MinKnow. However this also made it rather useful for manual basecalling on a cluster, using array jobs, where each folder is basecalled individually, producing its own `sequencing_summary.txt`, `reads.fastq`, and reads folder containing the newly basecalled fast5s. Taring those fast5 files up into a single file was needed to keep the sys admins at bay, complaining about our millions of individual files on their drives. This meant, whenever there was a need to use the fast5 files from an experiment, or many experiments, unpacking the fast5 files was a significant hurdle both in time and disk space.
 
-**fast5_fetcher** was built to  address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.
+**fast5_fetcher** was built to  address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be  extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.
 
 # Requirements
 
 Following a self imposed guideline, most things written to handle nanopore data or bioinformatics in general, will use as little 3rd party libraries as possible, aiming for only core libraries, or have all included files in the package.
 
-In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next update)
+In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next big update)
 
-There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed, there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration.
+##### Operating system:
+
+There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed (see below), there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration.
 
 # Getting Started
 
@@ -187,6 +214,14 @@ Download the repository:
 
     git clone https://github.com/Psy-Fer/fast5_fetcher.git
 
+If using MacOS, and NOT using homebrew, install it here:
+
+    https://brew.sh/
+
+then install gnu-tar with:
+
+    brew install gnu-tar
+
 ### Quick start
 
 Basic use on a local computer
@@ -221,25 +256,35 @@ See examples below for use on an **HPC** using **SGE**
 
 #### Full usage
 
-    usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [-s SEQ_SUM]
-                        [-i INDEX] [-o OUTPUT] [-z]
+    usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [--OSystem OSYSTEM]
+                            [-s SEQ_SUM] [-i INDEX] [-o OUTPUT] [-t]
+                            [-l TRIM_LIST] [-x PREFIX] [-z]
 
     fast_fetcher - extraction of specific nanopore fast5 files
 
     optional arguments:
-    -h, --help            show this help message and exit
-    -q FASTQ, --fastq FASTQ
-                        fastq.gz for read ids
-    -p PAF, --paf PAF     paf alignment file for read ids
-    -f FLAT, --flat FLAT  flat file of read ids
-    -s SEQ_SUM, --seq_sum SEQ_SUM
-                        sequencing_summary.txt.gz file
-    -i INDEX, --index INDEX
-                        index.gz file mapping fast5 files in tar archives
-    -o OUTPUT, --output OUTPUT
-                        output directory for extracted fast5s
-    -z, --pppp            Print out tar commands in batches for further
-                        processing
+      -h, --help            show this help message and exit
+      -q FASTQ, --fastq FASTQ
+                            fastq.gz for read ids
+      -p PAF, --paf PAF     paf alignment file for read ids
+      -f FLAT, --flat FLAT  flat file of read ids
+      --OSystem OSYSTEM     running operating system - leave default unless doing
+                            odd stuff
+      -s SEQ_SUM, --seq_sum SEQ_SUM
+                            sequencing_summary.txt.gz file
+      -i INDEX, --index INDEX
+                            index.gz file mapping fast5 files in tar archives
+      -o OUTPUT, --output OUTPUT
+                            output directory for extracted fast5s
+      -t, --trim            trim files as if standalone experiment, (fq, SS)
+      -l TRIM_LIST, --trim_list TRIM_LIST
+                            list of file names to trim, comma separated. fastq
+                            only needed for -p and -f modes
+      -x PREFIX, --prefix PREFIX
+                            trim file prefix, eg: barcode_01, output:
+                            barcode_01.fastq, barcode_01_seq_sum.txt
+      -z, --pppp            Print out tar commands in batches for further
+                            processing
 
 ## Examples
 
@@ -306,13 +351,19 @@ CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_v
 echo $CMD && $CMD
 ```
 
+## Trimming fastq and sequencing_summary files
+
+By using the `-t, --trim` option, each barcode will also have its own sequencing_summary file for downstream analysis. This is particularly useful if each barcode is a different sample or experiment, as the output is as if it was it's own individual flowcell.
+
+This method can also trim fastq, and sequencing_summary files when using the **paf** or **flat** methods. By using the prefix option, you can label the output names, otherwise generic defaults will be used.
+
 ## batch_tater.py
 
 Potato scripting engaged
 
 This is designed to run on the output files from `fast5_fetcher.py` using option `-z`. This writes out file lists for each tarball that contains reads you want to process. Then `batch_tater.py` can read those files, to open the individual tar files, and extract the files, meaning the file is only opened once.
 
-A recent test using the -z option on ~1.4Tb of data, across ~16/20 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 2min per array job to extract and repackage with batch_tater.py.
+A recent test using the -z option on ~2.2Tb of data, across ~11/27 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 20s per array job to extract and repackage with batch_tater.py.
 
 This is best used when you want to do something all at once and filter your reads. Other approaches may be better when you are demultiplexing.
 
@@ -331,7 +382,7 @@ BLAH=fast5/${FILE}
 
 mkdir ${TMPDIR}/fast5
 
-time python batch_tater.py name.index.gz ${BLAH} ${TMPDIR}/fast5/
+time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/
 
 echo "size of files:" >&2
 du -shc ${TMPDIR}/fast5/ >&2
@@ -355,13 +406,13 @@ echo $CMD && $CMD
 
 ## Acknowledgements
 
-I would like to thank the rest of my lab in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool.
+I would like to thank the rest of my lab (Shaun Carswell, Kirston Barton, Kai Martin) in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool.
 
 ## Cite
 
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)
 
-James M. Ferguson. Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher. (2018). doi:10.5281/zenodo.1413903
+James M. Ferguson, & Martin A. Smith. (2018, September 12). Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher (Version v1.0). Zenodo. <http://doi.org/10.5281/zenodo.1413903>
 
 ## License
 

diff --git a/batch_tater.py b/batch_tater.py
@@ -24,7 +24,7 @@
 
     mkdir ${TMPDIR}/fast5
 
-    time python batch_tater.py tar_index.txt ${BLAH} ${TMPDIR}/fast5/
+    time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/
 
     echo "size of files:" >&2
     du -shc ${TMPDIR}/fast5/ >&2
@@ -44,30 +44,40 @@
     Launch:
 
     echo $CMD && $CMD
+
+
+    stats:
+
+    fastq: 27491304
+    mapped: 11740093
+    z mode time: 10min
+    batch_tater total time: 21min
+    per job time: ~28s
+    number of CPUs: 100
 '''
 
 # being lazy and using sys.argv...i mean, it is pretty lit
-index = sys.argv[1]
+master = sys.argv[1]
 tar_list = sys.argv[2]
 save_path = sys.argv[3]
 
 # this will probs need to be changed based on naming convention
 # I think i was a little tired when I wrote this
-tar_name = '.'.join(tar_list.split('/')[-1].split('.')[:3])
+list_name = tar_list.split('/')[-1]
 
 PATH = 0
 
-# for stats later and easy job relauncing
-print >> sys.stderr, "extracting:", tar_name
-
 # not elegent, but gets it done
-with open(index, 'r') as f:
+with open(master, 'r') as f:
     for l in f:
         l = l.strip('\n')
-        if tar_name in l.split('/'):
-            PATH = l
+        l = l.split('\t')
+        if l[0] == list_name:
+            PATH = l[1]
             break
 
+# for stats later and easy job relauncing
+print >> sys.stderr, "extracting:", tar_list
 # do the thing. That --transform hack is awesome. Blows away all the leading folders.
 if PATH:
     cmd = "tar -xf {} --transform='s/.*\///' -C {} -T {}".format(
@@ -76,4 +86,4 @@
 
 else:
     print >> sys.stderr, "PATH not found! check index nooblet"
-    print >> sys.stderr, "inputs:", index, tar_list, tar_name
+    print >> sys.stderr, "inputs:", master, tar_list, save_path