Skip to content

Commit

Permalink
DOC Better AEMB docs
Browse files Browse the repository at this point in the history
Include an all by all script
  • Loading branch information
luispedro committed Oct 17, 2024
1 parent 604b6f8 commit 4b914a2
Show file tree
Hide file tree
Showing 3 changed files with 120 additions and 25 deletions.
118 changes: 118 additions & 0 deletions docs/aemb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Running SemiBin with strobealign-aemb

Strobealign-aemb is a fast abundance estimation method for metagenomic binning available [strobelign](https://github.com/ksahlin/strobealign) version 0.13 (or newer).
For technical reasons, we currently (as of SemiBin 2.1) can only use this mode with a minimum of 5 samples.

## Preparing the data for AEMB

You will need to generate abundance files for each sample in an _all-by-all_ manner.
This means that each sample will be mapped to the contigs of the other samples.

We will illustrate the process with a single sample, but this **must be repeated for all samples**.

1. Split the fasta files using the `split_contigs` subcommand:
```bash
mkdir -p aemb_output/sample1
SemiBin2 split_contigs -i sample1_contigs.fna.gz -o aemb_output/sample1
```

2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required
```bash
strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read1.pair.1.fq.gz read1.pair.2.fq.gz -R 6 -t 8 > sample1_sample1.tsv
strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read2.pair.1.fq.gz read2.pair.2.fq.gz -R 6 -t 8 > sample1_sample2.tsv
strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read3.pair.1.fq.gz read3.pair.2.fq.gz -R 6 -t 8 > sample1_sample3.tsv
strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read4.pair.1.fq.gz read4.pair.2.fq.gz -R 6 -t 8 > sample1_sample4.tsv
strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read5.pair.1.fq.gz read5.pair.2.fq.gz -R 6 -t 8 > sample1_sample5.tsv
```

## Running SemiBin2

Run SemiBin2 (this same as running BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM):

```bash
SemiBin2 single_easy_bin -i contig.fa -a sample1_*.tsv -o aemb_output/sample1
```

This will generate the bins in the `aemb_output/sample1` directory.

## A helper script to run all-by-all abundance estimates

Here is a helper script using [Jug](https://jug.readthedocs.io/en/latest/) to automate the process for all samples.
It is helpful if you are using Jug to parallelize the process, but if you remove the `@TaskGenerator` decorators, it will run sequentially.

It expects the following directory structure:

- `samples/` containing the assembled contigs in the format `sample1_assembled.fna.gz`, `sample2_assembled.fna.gz`, ...
- `clean-reads/` containing the reads in the format `sample1.pair.1.fq.gz`, `sample1.pair.2.fq.gz`, `sample2.pair.1.fq.gz`, `sample2.pair.2.fq.gz`, ...
- `aemb_output/` where the output will be written

```python
from jug import TaskGenerator
from jug.utils import jug_execute
import subprocess
from os import makedirs, path
import yaml

samples = [
'sample0',
'sample1',
'sample2',
'sample3',
'sample4',
'sample5',
'sample6',
'sample7',
]

STROBEALIGN_THREADS = 8
SEMIBIN_THREADS = STROBEALIGN_THREADS

@TaskGenerator
def generate_inputs(s):
contigs = f'samples/{s}_assembled.fna.gz'
if not path.exists(contigs):
raise IOError(f'Expected contig file {f} (for sample {s})')
out = f'aemb_output/{s}'
makedirs(out, exist_ok=True)
subprocess.check_call(
['SemiBin2', 'split_contigs',
'-i', contigs,
'-o', out])
return out

@TaskGenerator
def cross_map(ref_out, ref_s, s):
f1 = f'clean-reads/{s}.pair.1.fq.gz'
f2 = f'clean-reads/{s}.pair.2.fq.gz'
if not path.exists(f1):
raise IOError(f'Expected reads file {f1} (for sample {s})')
if not path.exists(f2):
raise IOError(f'Expected reads file {f2} (for sample {s}). Note that {f1} does exist!')
ofile = f'aemb_output/{ref_s}/mapped_{s}.tsv'
with open(ofile, 'wb') as out:
subprocess.check_call(
['strobealign',
'--aemb', f'{ref_out}/split_contigs.fna.gz',
f1, f2,
'-t', str(STROBEALIGN_THREADS),
'-R', '6'],
stdout=out)
return ofile


for s in samples:
out = generate_inputs(s)
tsv = []
for s2 in samples:
tsv.append(cross_map(out, s, s2))

sb = jug_execute(
['SemiBin2', 'single_easy_bin',
'--threads', str(SEMIBIN_THREADS),
'-i', f'samples/{s}_assembled.fna.gz',
'-a'] + tsv + [
'-o', f'aemb_output/{s}'])

```


26 changes: 1 addition & 25 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,28 +422,4 @@ See the comment above about how you can bypass most of the computation if you ha

## Running SemiBin with strobealign-aemb

Strobealign-aemb is a fast abundance estimation method for metagenomic binning.
As strobealign-aemb can not provide the mapping information for every position of the contig, so we can not run SemiBin2 with strobealign-aemb in binning modes where samples used smaller 5 and need to split the contigs to generate the must-link constratints.


1. Split the fasta files using the `split_contigs` subcommand:
```bash
SemiBin2 split_contigs -i contig.fa -o output
```
2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required
```bash
strobealign --aemb output/split_contigs.fna.gz read1_1.fq read1_2.fq -R 6 > sample1.tsv
strobealign --aemb output/split_contigs.fna.gz read2_1.fq read2_2.fq -R 6 > sample2.tsv
strobealign --aemb output/split_contigs.fna.gz read3_1.fq read3_2.fq -R 6 > sample3.tsv
strobealign --aemb output/split_contigs.fna.gz read4_1.fq read4_2.fq -R 6 > sample4.tsv
strobealign --aemb output/split_contigs.fna.gz read5_1.fq read5_2.fq -R 6 > sample5.tsv
```
3. Run SemiBin2 (same as running SemiBin with BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM):

```bash
SemiBin2 generate_sequence_features_single -i contig.fa -a *.tsv -o output
SemiBin2 generate_sequence_features_multi -i contig.fa -a *.tsv -s : -o output
SemiBin2 single_easy_bin -i contig.fa -a *.tsv -o output
SemiBin2 multi_easy_bin i contig.fa -a *.tsv -s : -o output
```

This has its own [dedicated page](aemb).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ nav:
- 'SemiBin2': semibin2.md
- 'Generating inputs to SemiBin': generate.md
- 'Output': output.md
- 'Running AEMB': aemb.md
- "Subcommand reference": subcommands.md
- "What's New": whatsnew.md
- "Training SemiBin models": training.md
Expand Down

0 comments on commit 4b914a2

Please sign in to comment.