DOC Better AEMB docs

Include an all by all script
BigDataBiology · Oct 17, 2024 · 4b914a2 · 4b914a2
1 parent 604b6f8
commit 4b914a2
Show file tree

Hide file tree

Showing 3 changed files with 120 additions and 25 deletions.
diff --git a/docs/aemb.md b/docs/aemb.md
@@ -0,0 +1,118 @@
+# Running SemiBin with strobealign-aemb
+
+Strobealign-aemb is a fast abundance estimation method for metagenomic binning available [strobelign](https://github.com/ksahlin/strobealign) version 0.13 (or newer).
+For technical reasons, we currently (as of SemiBin 2.1) can only use this mode with a minimum of 5 samples.
+
+## Preparing the data for AEMB
+
+You will need to generate abundance files for each sample in an _all-by-all_ manner.
+This means that each sample will be mapped to the contigs of the other samples.
+
+We will illustrate the process with a single sample, but this **must be repeated for all samples**.
+
+1. Split the fasta files using the `split_contigs` subcommand:
+```bash
+mkdir -p aemb_output/sample1
+SemiBin2 split_contigs -i sample1_contigs.fna.gz -o aemb_output/sample1
+```
+
+2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required
+```bash
+strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read1.pair.1.fq.gz read1.pair.2.fq.gz -R 6 -t 8 > sample1_sample1.tsv
+strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read2.pair.1.fq.gz read2.pair.2.fq.gz -R 6 -t 8 > sample1_sample2.tsv
+strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read3.pair.1.fq.gz read3.pair.2.fq.gz -R 6 -t 8 > sample1_sample3.tsv
+strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read4.pair.1.fq.gz read4.pair.2.fq.gz -R 6 -t 8 > sample1_sample4.tsv
+strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read5.pair.1.fq.gz read5.pair.2.fq.gz -R 6 -t 8 > sample1_sample5.tsv
+```
+
+## Running SemiBin2
+
+Run SemiBin2 (this same as running BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM):
+
+```bash
+SemiBin2 single_easy_bin -i contig.fa -a sample1_*.tsv -o aemb_output/sample1
+```
+
+This will generate the bins in the `aemb_output/sample1` directory.
+
+## A helper script to run all-by-all abundance estimates
+
+Here is a helper script using [Jug](https://jug.readthedocs.io/en/latest/) to automate the process for all samples.
+It is helpful if you are using Jug to parallelize the process, but if you remove the `@TaskGenerator` decorators, it will run sequentially.
+
+It expects the following directory structure:
+
+- `samples/` containing the assembled contigs in the format `sample1_assembled.fna.gz`, `sample2_assembled.fna.gz`, ...
+- `clean-reads/` containing the reads in the format `sample1.pair.1.fq.gz`, `sample1.pair.2.fq.gz`, `sample2.pair.1.fq.gz`, `sample2.pair.2.fq.gz`, ...
+- `aemb_output/` where the output will be written
+
+```python
+from jug import TaskGenerator
+from jug.utils import jug_execute
+import subprocess
+from os import makedirs, path
+import yaml
+
+samples = [
+        'sample0',
+        'sample1',
+        'sample2',
+        'sample3',
+        'sample4',
+        'sample5',
+        'sample6',
+        'sample7',
+        ]
+
+STROBEALIGN_THREADS = 8
+SEMIBIN_THREADS = STROBEALIGN_THREADS
+
+@TaskGenerator
+def generate_inputs(s):
+    contigs = f'samples/{s}_assembled.fna.gz'
+    if not path.exists(contigs):
+        raise IOError(f'Expected contig file {f} (for sample {s})')
+    out = f'aemb_output/{s}'
+    makedirs(out, exist_ok=True)
+    subprocess.check_call(
+            ['SemiBin2', 'split_contigs',
+             '-i', contigs,
+             '-o', out])
+    return out
+
+@TaskGenerator
+def cross_map(ref_out, ref_s, s):
+    f1 = f'clean-reads/{s}.pair.1.fq.gz'
+    f2 = f'clean-reads/{s}.pair.2.fq.gz'
+    if not path.exists(f1):
+        raise IOError(f'Expected reads file {f1} (for sample {s})')
+    if not path.exists(f2):
+        raise IOError(f'Expected reads file {f2} (for sample {s}). Note that {f1} does exist!')
+    ofile = f'aemb_output/{ref_s}/mapped_{s}.tsv'
+    with open(ofile, 'wb') as out:
+        subprocess.check_call(
+            ['strobealign',
+                '--aemb', f'{ref_out}/split_contigs.fna.gz',
+                f1, f2,
+                '-t', str(STROBEALIGN_THREADS),
+                '-R', '6'],
+            stdout=out)
+    return ofile
+
+
+for s in samples:
+    out = generate_inputs(s)
+    tsv = []
+    for s2 in samples:
+        tsv.append(cross_map(out, s, s2))
+
+    sb = jug_execute(
+        ['SemiBin2', 'single_easy_bin',
+            '--threads', str(SEMIBIN_THREADS),
+            '-i', f'samples/{s}_assembled.fna.gz',
+            '-a'] + tsv + [
+                '-o', f'aemb_output/{s}'])
+
+```
+
+
diff --git a/docs/usage.md b/docs/usage.md
@@ -422,28 +422,4 @@ See the comment above about how you can bypass most of the computation if you ha
 
 ## Running SemiBin with strobealign-aemb
 
-Strobealign-aemb is a fast abundance estimation method for metagenomic binning. 
-As strobealign-aemb can not provide the mapping information for every position of the contig, so we can not run SemiBin2 with strobealign-aemb in binning modes where samples used smaller 5 and need to split the contigs to generate the must-link constratints. 
-
-
-1. Split the fasta files using the `split_contigs` subcommand:
-```bash
-SemiBin2 split_contigs -i contig.fa -o output
-```
-2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required
-```bash
-strobealign --aemb output/split_contigs.fna.gz read1_1.fq read1_2.fq -R 6 > sample1.tsv
-strobealign --aemb output/split_contigs.fna.gz read2_1.fq read2_2.fq -R 6 > sample2.tsv
-strobealign --aemb output/split_contigs.fna.gz read3_1.fq read3_2.fq -R 6 > sample3.tsv
-strobealign --aemb output/split_contigs.fna.gz read4_1.fq read4_2.fq -R 6 > sample4.tsv
-strobealign --aemb output/split_contigs.fna.gz read5_1.fq read5_2.fq -R 6 > sample5.tsv
-```
-3. Run SemiBin2 (same as running SemiBin with BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM):
-
-```bash
-SemiBin2 generate_sequence_features_single -i contig.fa -a *.tsv -o output
-SemiBin2 generate_sequence_features_multi -i contig.fa -a *.tsv -s : -o output
-SemiBin2 single_easy_bin -i contig.fa -a *.tsv -o output
-SemiBin2 multi_easy_bin i contig.fa -a *.tsv -s : -o output
-```
-
+This has its own [dedicated page](aemb).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -15,6 +15,7 @@ nav:
         - 'SemiBin2': semibin2.md
         - 'Generating inputs to SemiBin': generate.md
         - 'Output': output.md
+        - 'Running AEMB': aemb.md
         - "Subcommand reference": subcommands.md
         - "What's New": whatsnew.md
         - "Training SemiBin models": training.md