Updated the docs

kharchenkolab · Sep 24, 2024 · 9e163c4 · 9e163c4
1 parent 0fbad4b
commit 9e163c4
Show file tree

Hide file tree

Showing 5 changed files with 32 additions and 15 deletions.
diff --git a/docs/Manifest.toml b/docs/Manifest.toml
@@ -1,6 +1,6 @@
 # This file is machine-generated - editing it directly is not advised
 
-julia_version = "1.10.4"
+julia_version = "1.10.5"
 manifest_format = "2.0"
 project_hash = "cfe419f0a5bd446f85d5c51ed1644192d6c33e05"
 
@@ -106,9 +106,9 @@ uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
 
 [[deps.Baysor]]
 deps = ["Base64", "CSV", "CairoMakie", "CategoricalArrays", "ColorSchemes", "Colors", "Comonicon", "Compat", "Configurations", "DataFrames", "Dates", "Deneb", "Distances", "Distributions", "FastRandPCA", "FileIO", "FiniteDiff", "GeometricalPredicates", "Graphs", "HDF5", "ImageCore", "ImageIO", "ImageMagick", "ImageMorphology", "JSON", "LazyModules", "LibGit2", "LinearAlgebra", "Logging", "MAT", "Makie", "MultivariateStats", "NearestNeighbors", "OrderedCollections", "Parquet", "Pipe", "Pkg", "ProgressMeter", "Random", "SparseArrays", "StaticArrays", "Statistics", "StatsBase", "Test", "UMAP", "UUIDs", "VoronoiDelaunay"]
-path = "/home/vpetukhov/.julia/dev/Baysor"
+path = "/Users/vpetukhov/.julia/dev/Baysor"
 uuid = "cc9f9468-1fbe-11e9-0acf-e9460511877c"
-version = "0.6.2"
+version = "0.7.0"
 
 [[deps.BufferedStreams]]
 git-tree-sha1 = "6863c5b7fc997eadcabdbaf6c5f201dc30032643"
@@ -1869,7 +1869,7 @@ version = "0.15.2+0"
 [[deps.libblastrampoline_jll]]
 deps = ["Artifacts", "Libdl"]
 uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
-version = "5.8.0+1"
+version = "5.11.0+0"
 
 [[deps.libfdk_aac_jll]]
 deps = ["Artifacts", "JLLWrappers", "Libdl"]

diff --git a/docs/src/preview.md b/docs/src/preview.md
@@ -2,6 +2,12 @@
 
 As a full run takes some time, it can be useful to run a quick preview to get meaning from the data and to get some guesses about the parameters of the full run. The only output of this step is `preview.html`, which visualizes the dataset and provides some diagnostics. Be careful, as this can get too large to be rendered for large datasets, so it's better to run on a subset of the data.
 
+```bash
+baysor preview <args> [options]
+```
+
+CLI parameters:
+
 ```@docs
 Baysor.CLI.preview
 ```
diff --git a/docs/src/run.md b/docs/src/run.md
@@ -5,7 +5,7 @@ Baysor can be used in several ways:
 A minimal command for cell segmentation:
 
 ```bash
-baysor run [-s SCALE -x X_COL -y Y_COL -z Z_COL --gene GENE_COL] -c config.toml MOLECULES_FILE [PRIOR_SEGMENTATION]
+baysor run [-s SCALE -x X_COL -y Y_COL -z Z_COL --gene GENE_COL -c config.toml -o OUTPUT_PATH] MOLECULES_FILE [PRIOR_SEGMENTATION]
 ```
 
 ## Dataset preview
@@ -22,6 +22,6 @@ baysor preview [-x X_COL -y Y_COL --gene GENE_COL -c config.toml -o OUTPUT_PATH]
 Many analyses don't require segmentation, and can be run on local neighborhoods instead. In the paper, we call them Neighborhood Composition Vectors (NCVs). To obtain them from Baysor, you may run `baysor segfree`. For more information, see `baysor segfree --help`. Minimal command:
 
 ```bash
-baysor segfree [-k K_NEIGHBORS -n NCVS_TO_SAVE -x X_COL -y Y_COL --gene GENE_COL -c config.toml -o OUTPUT_PATH] MOLECULES_FILE
+baysor segfree [-k K_NEIGHBORS -x X_COL -y Y_COL --gene GENE_COL -c config.toml -o OUTPUT_PATH] MOLECULES_FILE
 ```
 
diff --git a/docs/src/segfree.md b/docs/src/segfree.md
@@ -2,6 +2,12 @@
 
 Many analyses don't require segmentation, and can be run on local neighborhoods instead. In the paper, we call them Neighborhood Composition Vectors (NCVs). To obtain them from Baysor, you may run `baysor segfree`. This would output a [loom file](https://linnarssonlab.org/loompy/format/index.html) with NCVs of size `k` in (`/matrix`), storing `ncv_color` and `confidence` as column attributes (`/col_attrs/`).
 
+```bash
+baysor segfree <args> [options]
+```
+
+CLI parameters:
+
 ```@docs
 Baysor.CLI.segfree
 ```
diff --git a/docs/src/segmentation.md b/docs/src/segmentation.md
@@ -4,6 +4,12 @@
 
 To run the algorithm on your data, use the following command:
 
+```bash
+baysor run <args> [options] [flags]
+```
+
+CLI parameters:
+
 ```@docs
 Baysor.CLI.run
 ```
@@ -49,7 +55,8 @@ Please, notice that it's highly recommended to set `--n-clusters=1`, so molecule
 
 ## Outputs
 
-Segmentation results:
+### Segmentation results
+
 - **segmentation\_counts.loom** or **segmentation\_counts.tsv** (depends on `--count-matrix-format`): count matrix with segmented stats. In the case of loom format, column attributes also contain the same info as **segmentation\_cell\_stats.csv**.
 - **segmentation.csv**: segmentation info per molecule:
   - `confidence`: probability of a molecule to be real (i.e. not noise)
@@ -68,15 +75,17 @@ Segmentation results:
   - `max_cluster_frac` *(only if `n-clusters > 1`)*: fraction of the molecules coming from the most popular cluster. Cells with low `max_cluster_frac` are often doublets.
   - `lifespan`: number of iterations the given component exists. The maximal `lifespan` is clipped proportionally to the total number of iterations. Components with a short lifespan likely correspond to noise.
 
-Visualization:
+### Visualization
+
 - **segmentation\_polygons\_2d/3d.json**: polygons used for visualization in GeoJSON format. In the case of 3D segmentation, `2d.json` file contains polygons for all molecules pulled across the z-stack. And 3D shows polygons per z-slice. In case of continuous z, it's binned into 20 uniform bins. Depending on the format, it can contain `GeometryCollection` or `FeatureCollection`:
     - For 3D, the file contains an array of dictionaries (one per z-slice), each of which representing a `Collection`. For 2D data it's just a single dictionary with a `Collection`.
     - Each `GeometryCollection` has a field `geometries`, which is an array of polygons with `cell` field set to cell ids and `coordinates` set to its coordinates.
     - `FeatureCollection` is the format, compatible with [10x SpaceRanger](https://www.10xgenomics.com/support/software/xenium-ranger/1.7/analysis/inputs/XR-input-overview#compat-files). It contains a list of `Feature`s with cell ids saved in the `id` field and coordinates in `geometry/coordinates`.
 - **segmentation\_diagnostics.html**: visualization of the algorithm QC. *Shown only when `-p` is set.*
 - **segmentation\_borders.html**: visualization of cell borders for the dataset colored by local gene expression composition (first part) and molecule clusters (second part). *Shown only when `-p` is set.*
 
-Other:
+### Other
+
 - **segmentation\_params.dump.toml**: aggregated parameters from the config and CLI
 
 ## Choice of parameters
@@ -86,11 +95,7 @@ Most important parameters:
 - `scale` is the most sensitive parameter, which specifies the expected radius of a cell. It doesn't have to be precise, but the wrong setup can lead to over- or under-segmentation. This parameter is inferred automatically if cell centers are provided.
 - `min-molecules-per-cell` is the number of molecules, required for a cell to be considered as real. It really depends on the protocol. For instance, for ISS it's fine to set it to 3, while for MERFISH it can require hundreds of molecules.
 
-Some other sensitive parameters (normally, shouldn't be changed):
-
-- `new-component-weight` is proportional to the probability of generating a new cell for a molecule, instead of assigning it to one of the existing cells. More precisely, the probability to assign a molecule to a particular cell linearly depends on the number of molecules, already assigned to this cell. And this parameter is used as the number of molecules for a cell, which is just generated for this new molecule. The algorithm is robust to small changes in this parameter. And normally values in the range of 0.1-0.9 should work fine. Smaller values would lead to slower convergence of the algorithm, while larger values force the emergence of a large number of small cells on each iteration, which can produce noise in the result. In general, the default value should work well.
-
 Run parameters:
 
-- `--config.segmentation.n-cells-init` expected number of cells in data. This parameter influence only the convergence speed of the algorithm. It's better to set larger values than smaller ones.
-- `--config.segmentation.iters` number of iterations for the algorithm. **At the moment, no convergence criteria are implemented, so it will work exactly `iters` iterations**. Thus, too small values would lead to non-convergence of the algorithm, while larger ones would just increase working time. Optimal values can be estimated by the convergence plots, produced among the results.
+- `--config.segmentation.n-cells-init` expected number of cells in data. This parameter influences the convergence speed of the algorithm, as well as peak memory usage. Setting this value too small would lead to under-segmentation.
+- `--config.segmentation.iters` number of iterations for the algorithm. **At the moment, no convergence criteria are implemented, so it will work exactly `iters` iterations**. Thus, too small values would lead to non-convergence of the algorithm, while larger ones would just increase working time. Optimal values can be estimated by the convergence plots in the diagnostics report.