diff --git a/docs/aemb.md b/docs/aemb.md new file mode 100644 index 0000000..60553f4 --- /dev/null +++ b/docs/aemb.md @@ -0,0 +1,118 @@ +# Running SemiBin with strobealign-aemb + +Strobealign-aemb is a fast abundance estimation method for metagenomic binning available [strobelign](https://github.com/ksahlin/strobealign) version 0.13 (or newer). +For technical reasons, we currently (as of SemiBin 2.1) can only use this mode with a minimum of 5 samples. + +## Preparing the data for AEMB + +You will need to generate abundance files for each sample in an _all-by-all_ manner. +This means that each sample will be mapped to the contigs of the other samples. + +We will illustrate the process with a single sample, but this **must be repeated for all samples**. + +1. Split the fasta files using the `split_contigs` subcommand: +```bash +mkdir -p aemb_output/sample1 +SemiBin2 split_contigs -i sample1_contigs.fna.gz -o aemb_output/sample1 +``` + +2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required +```bash +strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read1.pair.1.fq.gz read1.pair.2.fq.gz -R 6 -t 8 > sample1_sample1.tsv +strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read2.pair.1.fq.gz read2.pair.2.fq.gz -R 6 -t 8 > sample1_sample2.tsv +strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read3.pair.1.fq.gz read3.pair.2.fq.gz -R 6 -t 8 > sample1_sample3.tsv +strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read4.pair.1.fq.gz read4.pair.2.fq.gz -R 6 -t 8 > sample1_sample4.tsv +strobealign --aemb aemb_output/sample1/split_contigs.fna.gz read5.pair.1.fq.gz read5.pair.2.fq.gz -R 6 -t 8 > sample1_sample5.tsv +``` + +## Running SemiBin2 + +Run SemiBin2 (this same as running BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM): + +```bash +SemiBin2 single_easy_bin -i contig.fa -a sample1_*.tsv -o aemb_output/sample1 +``` + +This will generate the bins in the `aemb_output/sample1` directory. + +## A helper script to run all-by-all abundance estimates + +Here is a helper script using [Jug](https://jug.readthedocs.io/en/latest/) to automate the process for all samples. +It is helpful if you are using Jug to parallelize the process, but if you remove the `@TaskGenerator` decorators, it will run sequentially. + +It expects the following directory structure: + +- `samples/` containing the assembled contigs in the format `sample1_assembled.fna.gz`, `sample2_assembled.fna.gz`, ... +- `clean-reads/` containing the reads in the format `sample1.pair.1.fq.gz`, `sample1.pair.2.fq.gz`, `sample2.pair.1.fq.gz`, `sample2.pair.2.fq.gz`, ... +- `aemb_output/` where the output will be written + +```python +from jug import TaskGenerator +from jug.utils import jug_execute +import subprocess +from os import makedirs, path +import yaml + +samples = [ + 'sample0', + 'sample1', + 'sample2', + 'sample3', + 'sample4', + 'sample5', + 'sample6', + 'sample7', + ] + +STROBEALIGN_THREADS = 8 +SEMIBIN_THREADS = STROBEALIGN_THREADS + +@TaskGenerator +def generate_inputs(s): + contigs = f'samples/{s}_assembled.fna.gz' + if not path.exists(contigs): + raise IOError(f'Expected contig file {f} (for sample {s})') + out = f'aemb_output/{s}' + makedirs(out, exist_ok=True) + subprocess.check_call( + ['SemiBin2', 'split_contigs', + '-i', contigs, + '-o', out]) + return out + +@TaskGenerator +def cross_map(ref_out, ref_s, s): + f1 = f'clean-reads/{s}.pair.1.fq.gz' + f2 = f'clean-reads/{s}.pair.2.fq.gz' + if not path.exists(f1): + raise IOError(f'Expected reads file {f1} (for sample {s})') + if not path.exists(f2): + raise IOError(f'Expected reads file {f2} (for sample {s}). Note that {f1} does exist!') + ofile = f'aemb_output/{ref_s}/mapped_{s}.tsv' + with open(ofile, 'wb') as out: + subprocess.check_call( + ['strobealign', + '--aemb', f'{ref_out}/split_contigs.fna.gz', + f1, f2, + '-t', str(STROBEALIGN_THREADS), + '-R', '6'], + stdout=out) + return ofile + + +for s in samples: + out = generate_inputs(s) + tsv = [] + for s2 in samples: + tsv.append(cross_map(out, s, s2)) + + sb = jug_execute( + ['SemiBin2', 'single_easy_bin', + '--threads', str(SEMIBIN_THREADS), + '-i', f'samples/{s}_assembled.fna.gz', + '-a'] + tsv + [ + '-o', f'aemb_output/{s}']) + +``` + + diff --git a/docs/usage.md b/docs/usage.md index dc98766..a0e1ea6 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -422,28 +422,4 @@ See the comment above about how you can bypass most of the computation if you ha ## Running SemiBin with strobealign-aemb -Strobealign-aemb is a fast abundance estimation method for metagenomic binning. -As strobealign-aemb can not provide the mapping information for every position of the contig, so we can not run SemiBin2 with strobealign-aemb in binning modes where samples used smaller 5 and need to split the contigs to generate the must-link constratints. - - -1. Split the fasta files using the `split_contigs` subcommand: -```bash -SemiBin2 split_contigs -i contig.fa -o output -``` -2. Map reads using [strobealign-aemb](https://github.com/ksahlin/strobealign) to generate the abundance information. Note that version 0.13 (or newer) is required -```bash -strobealign --aemb output/split_contigs.fna.gz read1_1.fq read1_2.fq -R 6 > sample1.tsv -strobealign --aemb output/split_contigs.fna.gz read2_1.fq read2_2.fq -R 6 > sample2.tsv -strobealign --aemb output/split_contigs.fna.gz read3_1.fq read3_2.fq -R 6 > sample3.tsv -strobealign --aemb output/split_contigs.fna.gz read4_1.fq read4_2.fq -R 6 > sample4.tsv -strobealign --aemb output/split_contigs.fna.gz read5_1.fq read5_2.fq -R 6 > sample5.tsv -``` -3. Run SemiBin2 (same as running SemiBin with BAM files, except using `-a` to pass in the abundance files instead of `-b` to pass in BAM/SAM): - -```bash -SemiBin2 generate_sequence_features_single -i contig.fa -a *.tsv -o output -SemiBin2 generate_sequence_features_multi -i contig.fa -a *.tsv -s : -o output -SemiBin2 single_easy_bin -i contig.fa -a *.tsv -o output -SemiBin2 multi_easy_bin i contig.fa -a *.tsv -s : -o output -``` - +This has its own [dedicated page](aemb). diff --git a/mkdocs.yml b/mkdocs.yml index 9471877..be6a654 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,6 +15,7 @@ nav: - 'SemiBin2': semibin2.md - 'Generating inputs to SemiBin': generate.md - 'Output': output.md + - 'Running AEMB': aemb.md - "Subcommand reference": subcommands.md - "What's New": whatsnew.md - "Training SemiBin models": training.md