Update documentation, closes #582

neuml · Oct 27, 2023 · 3847e0f · 3847e0f
1 parent 4447685
commit 3847e0f
Show file tree

Hide file tree

Showing 7 changed files with 77 additions and 21 deletions.
diff --git a/docs/embeddings/configuration/ann.md b/docs/embeddings/configuration/ann.md
@@ -21,19 +21,24 @@ faiss:
                 x = 4 * sqrt(embeddings count)
     nprobe: search probe setting (int) - defaults to x/16 (as defined above)
             for larger indexes
-    quantize: store vectors with 8-bit precision vs 32-bit (boolean)
-              defaults to false
+    nflip: same as nprobe - only used with binary hash indexes
+    quantize: store vectors with x-bit precision vs 32-bit (bool|int)
+              true sets 8-bit precision, false disables, int sets specified
+              precision
     mmap: load as on-disk index (boolean) - trade query response time for a
           smaller RAM footprint, defaults to false
     sample: percent of data to use for model training (0.0 - 1.0)
             reduces indexing time for larger (>1M+ row) indexes, defaults to 1.0
 ```
 
+Faiss supports both floating point and binary indexes. Floating point indexes are the default. Binary indexes are used when indexing scalar-quantized datasets.
+
 See the following Faiss documentation links for more information.
 
 - [Guidelines for choosing an index](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index)
 - [Index configuration summary](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes)
 - [Index Factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory)
+- [Binary Indexes](https://github.com/facebookresearch/faiss/wiki/Binary-indexes)
 - [Search Tuning](https://github.com/facebookresearch/faiss/wiki/Faster-search)
 
 ### hnsw

diff --git a/docs/embeddings/configuration/database.md b/docs/embeddings/configuration/database.md
@@ -9,7 +9,7 @@ content: boolean|sqlite|duckdb|client|url|custom
 
 Enables content storage. When true, the default storage engine, `sqlite` will be used to save metadata alongside embeddings vectors.
 
-Client-server connections are supported with either `client` or a full connection URL. When set to `client`, the CLIENT_URL environment variable must be set to the full connection URL.
+Client-server connections are supported with either `client` or a full connection URL. When set to `client`, the CLIENT_URL environment variable must be set to the full connection URL. See the [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls) documentation for more information on how to construct connection strings for client-server databases.
 
 Add custom storage engines via setting this parameter to the fully resolvable class string.
 

diff --git a/docs/embeddings/configuration/vectors.md b/docs/embeddings/configuration/vectors.md
@@ -78,13 +78,15 @@ encodebatch: int
 
 Sets the encode batch size. This parameter controls the underlying vector model batch size. This often corresponds to a GPU batch size, which controls GPU memory usage.
 
-## tokenize
+## quantize
 ```yaml
-tokenize: boolean
+quantize: int|bool
 ```
 
-Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of
-English language sentence embeddings in some situations.
+Enables scalar quantization at the specified precision. Supports 1-bit through 8-bit quantization. Scalar quantization transforms continuous floating point values
+to discrete unsigned integers.
+
+This parameter supports booleans for backwards compatability. When set to true/false, this flag sets [faiss.quantize](../ann/#faiss).
 
 ## instructions
 ```yaml
@@ -95,4 +97,12 @@ instructions:
 
 Instruction-based models use prefixes to modify how embeddings are computed. This is especially useful with asymmetric search, which is when the query and indexed data are of vastly different lengths. In other words, short queries with long documents.
 
-[E5-base](https://huggingface.co/intfloat/e5-base) is an example of a model that accepts instructions. It takes `query: ` and `passage: ` prefixes and uses those to generate embeddings that work well for asymmetric search.
+[E5-base](https://huggingface.co/intfloat/e5-base) is an example of a model that accepts instructions. It takes `query: ` and `passage: ` prefixes and uses those to generate embeddings that work well for asymmetric search.
+
+## tokenize
+```yaml
+tokenize: boolean
+```
+
+Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of
+English language sentence embeddings in some situations.
diff --git a/docs/embeddings/index.md b/docs/embeddings/index.md
@@ -3,7 +3,7 @@
 ![embeddings](../images/embeddings.png#only-light)
 ![embeddings](../images/embeddings-dark.png#only-dark)
 
-Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
+Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
 
 The following code snippet shows how to build and search an embeddings index.
 

diff --git a/docs/embeddings/indexing.md b/docs/embeddings/indexing.md
@@ -3,7 +3,7 @@
 ![indexing](../images/indexing.png#only-light)
 ![indexing](../images/indexing-dark.png#only-dark)
 
-This section gives an in-depth overview on how to index data with txtai. We'll cover vectorization, indexing, updating and deleting data.
+This section gives an in-depth overview on how to index data with txtai. We'll cover vectorization, indexing/updating/deleting data and the various components of an embeddings database.
 
 ## Vectorization
 
@@ -13,6 +13,8 @@ The [batch](../configuration/vectors#batch) and [encodebatch](../configuration/v
 
 Data is buffered to temporary storage during indexing as embeddings vectors can be quite large (for example 768 dimensions of float32 is 768 * 4 = 3072 bytes per vector). Once vectorization is complete, a mmapped array is created with all vectors for [Approximate Nearest Neighbor (ANN)](../configuration/vectors#backend) indexing.
 
+The terms `ANN` and `dense vector index` are used interchangeably throughout txtai's documentation.
+
 ## Setting a backend
 
 As mentioned above, computed vectors are stored in an ANN. There are various index [backends](../configuration/ann#backend) that can be configured. Faiss is the default backend.
@@ -104,7 +106,19 @@ embeddings.graph.showpath(id1, id2)
 
 Graphs are persisted alongside an embeddings index. Each save and load will also save and load the graph.
 
-## Scoring
+## Sparse vectors
+
+Scoring instances can create a standalone [keyword](../configuration/general#keyword) or sparse index (BM25, TF-IDF). This enables [hybrid](../configuration/general/#hybrid) search when there is an available dense vector index.
+
+The terms `sparse vector index`, `keyword index`, `terms index` and `scoring index` are used interchangeably throughout txtai's documentation.
+
+See [this link](../../examples/#semantic-search) to learn more.
+
+## Subindexes
+
+An embeddings instance can optionally have associated [subindexes](../configuration/general/#indexes), which are also embeddings databases. This enables indexing additional fields, vector models and much more.
+
+## Word vectors
 
 When using [word vector backed models](../configuration/vectors#words) with scoring set, a separate call is required before calling `index` as follows:
 
@@ -113,6 +127,4 @@ embeddings.score(rows)
 embeddings.index(rows)
 ```
 
-Two calls are required to support generator-backed iteration of data. The scoring index requires a separate full-pass of the data.
-
-Scoring instances can also create a standalone keyword-based index (BM25, TF-IDF). See [this link](../../examples/#semantic-search) to learn more.
+Both calls are required to support generator-backed iteration of data with word vectors models.
diff --git a/docs/embeddings/methods.md b/docs/embeddings/methods.md
@@ -3,14 +3,19 @@
 ::: txtai.embeddings.Embeddings
     selection:
         filters:
+            - "!columns"
+            - "!createann"
             - "!createcloud"
             - "!createdatabase"
             - "!creategraph"
+            - "!createindexes"
+            - "!createscoring"
             - "!checkarchive"
             - "!configure"
+            - "!defaultallowed"
             - "!defaults"
+            - "!initindex"
             - "!loadconfig"
             - "!loadquery"
             - "!loadvectors"
-            - "!normalize"
             - "!saveconfig"
diff --git a/docs/embeddings/query.md b/docs/embeddings/query.md
@@ -35,24 +35,30 @@ SELECT id, text, score FROM txtai WHERE similar('feel good story')
 SELECT id, text, score FROM txtai WHERE similar('feel good story')
 ```
 
-The similar clause takes two arguments:
+The similar clause takes the following arguments:
 
 ```sql
-similar("query", "number of candidates")
+similar("query", "number of candidates", "index", "weights")
 ```
 
 | Argument              | Description                            |
 | --------------------- | ---------------------------------------|
 | query                 | natural language query to run          |
 | number of candidates  | number of candidate results to return  |
+| index                 | target index name                      |
+| weights               | hybrid score weights                   |
 
-The txtai query layer has to join results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.
+The txtai query layer joins results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.
 
 The number of candidates should be larger than the desired number of results when applying additional filter clauses. This ensures that `limit` results are still returned after applying additional filters. If the number of candidates is not specified, it is defaulted as follows:
 
 - For a single query filter clause, the default is the query limit
 - With multiple filtering clauses, the default is 10x the query limit
 
+The index name is only applicable when [subindexes](../configuration/general/#indexes) are enabled. This specifies the index to use for the query.
+
+Weights sets the hybrid score weights when an index has both a sparse and dense index.
+
 ### Dynamic columns
 
 Content can be indexed in multiple ways when content storage is enabled. [Remember that input documents](../#index) take the form of `(id, data, tags)` tuples. If data is a string, then content is primarily filtered with similar clauses. If data is a dictionary, then all fields in the dictionary are indexed and searchable.
@@ -161,10 +167,28 @@ entry >= date('now', '-1 day')
 
 This requires setting a [query translation model](../configuration/database#query). The default query translation model is [t5-small-txtsql](https://huggingface.co/NeuML/t5-small-txtsql) but this can easily be finetuned to handle different use cases.
 
+## Hybrid search
+
+When an embeddings database has both a sparse and dense index, both indexes will be queried and the results will be equally weighted unless otherwise specified.
+
+```python
+embeddings.search("query", weights=0.5)
+embeddings.search("select id, text, score from txtai where similar('query', 0.5)")
+```
+
+## Subindexes
+
+Subindexes can be queried as follows:
+
+```python
+embeddings.search("query", index="subindex1")
+embeddings.search("select id, text, score from txtai where similar('query', 'subindex1')")
+```
+
 ## Combined index architecture
 
-When content storage is enabled, txtai becomes a dual storage engine. Content is stored in an underlying database (currently supports SQLite) along with an Approximate Nearest Neighbor (ANN) index. These components combine to deliver similarity search alongside traditional structured search.
+txtai has multiple storage and indexing components. Content is stored in an underlying database along with an approximate nearest neighbor (ANN) index, keyword index and graph network. These components combine to deliver similarity search alongside traditional structured search.
 
-The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is applied. This step takes those ids and effectively inserts them as part of the underlying database query.
+The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is executed. This step takes those ids and effectively inserts them as part of the underlying database query. The same steps apply with keyword indexes except a term frequency index is used to find the best matching ids.
 
-Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into `json_extract` clauses. This same concept can be expanded to other storage engines like PostgreSQL and could even work with NoSQL stores. 
+Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into `json_extract` clauses. Client-server databases are supported via [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) and dynamic columns are supported provided the underlying engine has [JSON](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.JSON) support.