DOC Update docs for SemiBin2

BigDataBiology · Oct 19, 2023 · 07ddb06 · 07ddb06
1 parent 537e34c
commit 07ddb06
Show file tree

Hide file tree

Showing 5 changed files with 119 additions and 113 deletions.
diff --git a/README.md b/README.md
@@ -132,7 +132,7 @@ The `single_easy_bin` command can be used to produce results in a single step.
 For example:
 
 ```bash
-SemiBin \
+SemiBin2 \
     single_easy_bin \
     --input-fasta contig.fa \
     --input-bam mapped_reads.sorted.bam \
@@ -143,7 +143,7 @@ SemiBin \
 Alternatively, you can train a new model for that sample, by not passing in the `--environment` flag:
 
 ```bash
-SemiBin \
+SemiBin2 \
     single_easy_bin \
     --input-fasta contig.fa \
     --input-bam mapped_reads.sorted.bam \
@@ -206,15 +206,15 @@ CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC
 You can use this to get the combined contig:
 
 ```bash
-SemiBin concatenate_fasta -i contig*.fa -o output
+SemiBin2 concatenate_fasta -i contig*.fa -o output
 ```
 
 If either the sample or the contig names use the default separator (`:`), you will need to change it with the `--separator`,`-s` argument.
 
 After mapping samples (individually) to the combined FASTA file, you can get the results with one line of code:
 
 ```bash
-SemiBin multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output
+SemiBin2 multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output
 ```
 
 ## Output

diff --git a/docs/index.md b/docs/index.md
@@ -4,8 +4,7 @@ If you use this software in a publication please cite:
 
 >  Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. [A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments](https://doi.org/10.1038/s41467-022-29843-y). *Nat Commun* **13,** 2326 (2022). [https://doi.org/10.1038/s41467-022-29843-y](https://doi.org/10.1038/s41467-022-29843-y)
 
-The self-supervised approach and the algorithms used for long-read datasets (as
-well as their benchmarking) are described in
+The self-supervised approach and the algorithms used for long-read datasets (as well as their benchmarking) are described in
 
 > Pan, S.; Zhao, XM; Coelho, LP. [SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing](https://doi.org/10.1101/2023.01.09.523201). *bioRxiv preprint* 2023.01.09.523201; [https://doi.org/10.1101/2023.01.09.523201](https://doi.org/10.1101/2023.01.09.523201)
 
@@ -15,11 +14,8 @@ It supports single sample, co-assembly, and multi-samples binning modes.
 
 ## SemiBin2
 
-The functionality of SemiBin2 is available already since version 1.4!
-
-- To use the self-supervised learning mode, use options `--self-supervised`
-- If you are using long-reads, use option `--sequencing-type=long_read`
-
+When you install the SemiBin package you get both the newer `SemiBin2` command and the older `SemiBin` command.
+It is recommended that you use the newer one exclusively for new project and the old one only for backwards compatibility.
 
 ## Install
 
@@ -47,19 +43,18 @@ If your assembled contigs are in a file called `S1.fa` (contig file in FASTA for
 **1. Using a pre-trained model.** This is the fastest option and should work the best if you have metagenomes from one of our prebuilt habitats (alternatively, you can use the `global` "habitat" which combines all of them).
 
 ```bash
-SemiBin single_easy_bin \
+SemiBin2 single_easy_bin \
         --environment human_gut \
         -i S1.fa \
         -b S1.sorted.bam \
         -o output
 ```
 
 **2. Learn a new model.** Alternatively, you can learn a new model for your data.
-The main disadvantage is that this approach will take a lot more time and use a lot more memory.
-While using a pre-trained model should take a few minutes and use 4-6GB of RAM, training a new model may take several hours and use 40GB of RAM.
+The main disadvantage is that this approach will take longer:
 
 ```bash
-SemiBin single_easy_bin \
+SemiBin2 single_easy_bin \
         --environment human_gut \
         -i S1.fa \
         -b S1.sorted.bam \

diff --git a/docs/semibin2.md b/docs/semibin2.md
@@ -1,11 +1,8 @@
 # SemiBin2
 
-Starting with version 1.5 (officially _SemiBin2 beta_), installing the SemiBin
-package installs two scripts: `SemiBin` and `SemiBin2`.
-
-They have the same functionality, but slightly different interfaces. The exact
-interface to `SemiBin2` should be considered as unstable (while we will strive
-to maintain backwards compatibility if you call the `SemiBin` script and will freeze `SemiBin2` when version 2.0 is released).
+Starting with version 1.5 (officially _SemiBin2 beta_), installing the SemiBin package installs two scripts: `SemiBin` and `SemiBin2`.
+They have the same functionality, but slightly different interfaces.
+As of version 2.0.0, the older `SemiBin` command is _not recommended_ (except for backwards compability) and newer projects should use `SemiBin2`.
 
 ## Upgrading to SemiBin2
 

diff --git a/docs/subcommands.md b/docs/subcommands.md
@@ -6,7 +6,7 @@ This page exhaustively lists all the subcommands and their options.
 SemiBin works using a _subcommand_ interface.
 Most uses are covered by either the `single_easy_bin` or `multi_easy_bin` subcommands, but you can use the other subcommands for more control.
 
-[![Overview of SemiBin subcommands](SemiBin.png)](SemiBin.png)
+[![Overview of SemiBin2 subcommands](SemiBin.png)](SemiBin.png)
 
 ### single_easy_bin
 
@@ -47,12 +47,12 @@ Starting in version 1.3, self-supervised learning is also supported, which shoul
 * `--write-pre-reclustering-bins`/`--no-write-pre-reclustering-bins`: Whether to write pre-reclustering bins (defaults to true in SemiBin1; and false in SemiBin2).
 * `--engine`: device used to train the model (`auto`/`gpu`/`cpu`); `auto` (default) means that SemiBin with attempt to detect and use GPU and fallback to CPU if no GPU is found.
 * `--tmpdir`: set temporary directory.
-* `-r/--reference-db-data-dir`: GTDB reference directory (Default: `$HOME/.cache/SemiBin/mmseqs2-GTDB`). SemiBin will lazily download GTDB if it is not found there. Note that a lot of disk space is used
+* `-r/--reference-db-data-dir`: GTDB reference directory (Default: `$HOME/.cache/SemiBin/mmseqs2-GTDB`). This is only useful if you are using the deprecated semi-supervised mode). In that case, SemiBin will lazily download GTDB if it is not found there. Note that a lot of disk space is used.
 
 #### Optional arguments to set internal parameters
 
 * `--random-seed`: Random seed to reproduce results.
-* `--orf-finder` : gene predictor used to estimate the number of bins. Must be one of `prodigal` (default since `v0.7`), `fast-naive` (available since `v1.5`, this is a very fast internal implementation), or `fraggenescan` (which is faster, but cannot be installed in all platforms).
+* `--orf-finder` : gene predictor used to estimate the number of bins. Must be one of `prodigal` (default since `v0.7`), `fast-naive` (available since `v1.5`, this is a very fast internal implementation, default if using `SemiBin2`), or `fraggenescan` (which is faster than `prodigal`, but cannot be installed in all platforms and is still not as fast as the `fast-naive` method).
 
 
 #### Optional arguments to bypass internal steps
@@ -102,6 +102,10 @@ The command `multi_easy_bin` requires the combined contig file from several samp
 
 ### generate_cannot_links
 
+:::{warning}
+This is only useful for using the older (deprecated) semi-supervised approach
+:::
+
 Run the contig annotations using mmseqs with GTDB and generate `cannot-link` file used in the semi-supervised deep learning model training.
 
 The subcommand `generate_cannot_links` requires the contig file as inputs and outputs the `cannot-link` constraints.
@@ -165,7 +169,7 @@ These are the same as for `multi_easy_bin`.
 * `-p/--processes/-t/--threads`, `--ratio`, `--min-len`, `--ml-threshold` and `--tmpdir` are the same as for `single_easy_bin`.
 * `-s/--separator` are the same as for `multi_easy_bin`.
 
-### train (train_semi in SemiBin2)
+### train (`train_semi` in SemiBin2)
 
 The `train` (`train_semi` in `SemiBin2`) subcommand requires the contig file and outputs from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `generate_cannot_links` subcommand as inputs (`data.csv`, `data_split.csv` and `cannot.txt`) and outputs the trained model.
 
@@ -216,9 +220,9 @@ The `train_self` subcommand requires the contig file and outputs from the `gener
 
 These have the same meaning as for `single_easy_bin`
 
-### bin
+### bin_short
 
-The `bin` subcommand requires the contig file and output (files `data.csv`, `model.h5`) from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `train` subcommand as inputs and output the final bins in the `output_recluster_bins` directory.
+The `bin_short` subcommand (`bin` is an accepted alias, for backwards compatibility) requires the contig file and output (files `data.csv`, `model.h5`) from the `generate_sequence_features_single`, `generate_sequence_features_multi` and `train` subcommand as inputs and output the final bins in the `output_recluster_bins` directory.
 
 #### Required arguments