Skip to content

Commit

Permalink
DOC Update docs for SemiBin2
Browse files Browse the repository at this point in the history
  • Loading branch information
luispedro committed Oct 19, 2023
1 parent 537e34c commit 07ddb06
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 113 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ The `single_easy_bin` command can be used to produce results in a single step.
For example:

```bash
SemiBin \
SemiBin2 \
single_easy_bin \
--input-fasta contig.fa \
--input-bam mapped_reads.sorted.bam \
Expand All @@ -143,7 +143,7 @@ SemiBin \
Alternatively, you can train a new model for that sample, by not passing in the `--environment` flag:

```bash
SemiBin \
SemiBin2 \
single_easy_bin \
--input-fasta contig.fa \
--input-bam mapped_reads.sorted.bam \
Expand Down Expand Up @@ -206,15 +206,15 @@ CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC
You can use this to get the combined contig:

```bash
SemiBin concatenate_fasta -i contig*.fa -o output
SemiBin2 concatenate_fasta -i contig*.fa -o output
```

If either the sample or the contig names use the default separator (`:`), you will need to change it with the `--separator`,`-s` argument.

After mapping samples (individually) to the combined FASTA file, you can get the results with one line of code:

```bash
SemiBin multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output
SemiBin2 multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output
```

## Output
Expand Down
17 changes: 6 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ If you use this software in a publication please cite:

> Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. [A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments](https://doi.org/10.1038/s41467-022-29843-y). *Nat Commun* **13,** 2326 (2022). [https://doi.org/10.1038/s41467-022-29843-y](https://doi.org/10.1038/s41467-022-29843-y)
The self-supervised approach and the algorithms used for long-read datasets (as
well as their benchmarking) are described in
The self-supervised approach and the algorithms used for long-read datasets (as well as their benchmarking) are described in

> Pan, S.; Zhao, XM; Coelho, LP. [SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing](https://doi.org/10.1101/2023.01.09.523201). *bioRxiv preprint* 2023.01.09.523201; [https://doi.org/10.1101/2023.01.09.523201](https://doi.org/10.1101/2023.01.09.523201)
Expand All @@ -15,11 +14,8 @@ It supports single sample, co-assembly, and multi-samples binning modes.

## SemiBin2

The functionality of SemiBin2 is available already since version 1.4!

- To use the self-supervised learning mode, use options `--self-supervised`
- If you are using long-reads, use option `--sequencing-type=long_read`

When you install the SemiBin package you get both the newer `SemiBin2` command and the older `SemiBin` command.
It is recommended that you use the newer one exclusively for new project and the old one only for backwards compatibility.

## Install

Expand Down Expand Up @@ -47,19 +43,18 @@ If your assembled contigs are in a file called `S1.fa` (contig file in FASTA for
**1. Using a pre-trained model.** This is the fastest option and should work the best if you have metagenomes from one of our prebuilt habitats (alternatively, you can use the `global` "habitat" which combines all of them).

```bash
SemiBin single_easy_bin \
SemiBin2 single_easy_bin \
--environment human_gut \
-i S1.fa \
-b S1.sorted.bam \
-o output
```

**2. Learn a new model.** Alternatively, you can learn a new model for your data.
The main disadvantage is that this approach will take a lot more time and use a lot more memory.
While using a pre-trained model should take a few minutes and use 4-6GB of RAM, training a new model may take several hours and use 40GB of RAM.
The main disadvantage is that this approach will take longer:

```bash
SemiBin single_easy_bin \
SemiBin2 single_easy_bin \
--environment human_gut \
-i S1.fa \
-b S1.sorted.bam \
Expand Down
9 changes: 3 additions & 6 deletions docs/semibin2.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
# SemiBin2

Starting with version 1.5 (officially _SemiBin2 beta_), installing the SemiBin
package installs two scripts: `SemiBin` and `SemiBin2`.

They have the same functionality, but slightly different interfaces. The exact
interface to `SemiBin2` should be considered as unstable (while we will strive
to maintain backwards compatibility if you call the `SemiBin` script and will freeze `SemiBin2` when version 2.0 is released).
Starting with version 1.5 (officially _SemiBin2 beta_), installing the SemiBin package installs two scripts: `SemiBin` and `SemiBin2`.
They have the same functionality, but slightly different interfaces.
As of version 2.0.0, the older `SemiBin` command is _not recommended_ (except for backwards compability) and newer projects should use `SemiBin2`.

## Upgrading to SemiBin2

Expand Down
16 changes: 10 additions & 6 deletions docs/subcommands.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This page exhaustively lists all the subcommands and their options.
SemiBin works using a _subcommand_ interface.
Most uses are covered by either the `single_easy_bin` or `multi_easy_bin` subcommands, but you can use the other subcommands for more control.

[![Overview of SemiBin subcommands](SemiBin.png)](SemiBin.png)
[![Overview of SemiBin2 subcommands](SemiBin.png)](SemiBin.png)

### single_easy_bin

Expand Down Expand Up @@ -47,12 +47,12 @@ Starting in version 1.3, self-supervised learning is also supported, which shoul
* `--write-pre-reclustering-bins`/`--no-write-pre-reclustering-bins`: Whether to write pre-reclustering bins (defaults to true in SemiBin1; and false in SemiBin2).
* `--engine`: device used to train the model (`auto`/`gpu`/`cpu`); `auto` (default) means that SemiBin with attempt to detect and use GPU and fallback to CPU if no GPU is found.
* `--tmpdir`: set temporary directory.
* `-r/--reference-db-data-dir`: GTDB reference directory (Default: `$HOME/.cache/SemiBin/mmseqs2-GTDB`). SemiBin will lazily download GTDB if it is not found there. Note that a lot of disk space is used
* `-r/--reference-db-data-dir`: GTDB reference directory (Default: `$HOME/.cache/SemiBin/mmseqs2-GTDB`). This is only useful if you are using the deprecated semi-supervised mode). In that case, SemiBin will lazily download GTDB if it is not found there. Note that a lot of disk space is used.

#### Optional arguments to set internal parameters

* `--random-seed`: Random seed to reproduce results.
* `--orf-finder` : gene predictor used to estimate the number of bins. Must be one of `prodigal` (default since `v0.7`), `fast-naive` (available since `v1.5`, this is a very fast internal implementation), or `fraggenescan` (which is faster, but cannot be installed in all platforms).
* `--orf-finder` : gene predictor used to estimate the number of bins. Must be one of `prodigal` (default since `v0.7`), `fast-naive` (available since `v1.5`, this is a very fast internal implementation, default if using `SemiBin2`), or `fraggenescan` (which is faster than `prodigal`, but cannot be installed in all platforms and is still not as fast as the `fast-naive` method).


#### Optional arguments to bypass internal steps
Expand Down Expand Up @@ -102,6 +102,10 @@ The command `multi_easy_bin` requires the combined contig file from several samp

### generate_cannot_links

:::{warning}
This is only useful for using the older (deprecated) semi-supervised approach
:::

Run the contig annotations using mmseqs with GTDB and generate `cannot-link` file used in the semi-supervised deep learning model training.

The subcommand `generate_cannot_links` requires the contig file as inputs and outputs the `cannot-link` constraints.
Expand Down Expand Up @@ -165,7 +169,7 @@ These are the same as for `multi_easy_bin`.
* `-p/--processes/-t/--threads`, `--ratio`, `--min-len`, `--ml-threshold` and `--tmpdir` are the same as for `single_easy_bin`.
* `-s/--separator` are the same as for `multi_easy_bin`.

### train (train_semi in SemiBin2)
### train (`train_semi` in SemiBin2)

The `train` (`train_semi` in `SemiBin2`) subcommand requires the contig file and outputs from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `generate_cannot_links` subcommand as inputs (`data.csv`, `data_split.csv` and `cannot.txt`) and outputs the trained model.

Expand Down Expand Up @@ -216,9 +220,9 @@ The `train_self` subcommand requires the contig file and outputs from the `gener

These have the same meaning as for `single_easy_bin`

### bin
### bin_short

The `bin` subcommand requires the contig file and output (files `data.csv`, `model.h5`) from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `train` subcommand as inputs and output the final bins in the `output_recluster_bins` directory.
The `bin_short` subcommand (`bin` is an accepted alias, for backwards compatibility) requires the contig file and output (files `data.csv`, `model.h5`) from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `train` subcommand as inputs and output the final bins in the `output_recluster_bins` directory.

#### Required arguments

Expand Down
Loading

0 comments on commit 07ddb06

Please sign in to comment.