Skip to content

Commit

Permalink
Update documentation, closes #582
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Oct 27, 2023
1 parent 4447685 commit 3847e0f
Show file tree
Hide file tree
Showing 7 changed files with 77 additions and 21 deletions.
9 changes: 7 additions & 2 deletions docs/embeddings/configuration/ann.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,24 @@ faiss:
x = 4 * sqrt(embeddings count)
nprobe: search probe setting (int) - defaults to x/16 (as defined above)
for larger indexes
quantize: store vectors with 8-bit precision vs 32-bit (boolean)
defaults to false
nflip: same as nprobe - only used with binary hash indexes
quantize: store vectors with x-bit precision vs 32-bit (bool|int)
true sets 8-bit precision, false disables, int sets specified
precision
mmap: load as on-disk index (boolean) - trade query response time for a
smaller RAM footprint, defaults to false
sample: percent of data to use for model training (0.0 - 1.0)
reduces indexing time for larger (>1M+ row) indexes, defaults to 1.0
```

Faiss supports both floating point and binary indexes. Floating point indexes are the default. Binary indexes are used when indexing scalar-quantized datasets.

See the following Faiss documentation links for more information.

- [Guidelines for choosing an index](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index)
- [Index configuration summary](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes)
- [Index Factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory)
- [Binary Indexes](https://github.com/facebookresearch/faiss/wiki/Binary-indexes)
- [Search Tuning](https://github.com/facebookresearch/faiss/wiki/Faster-search)

### hnsw
Expand Down
2 changes: 1 addition & 1 deletion docs/embeddings/configuration/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ content: boolean|sqlite|duckdb|client|url|custom
Enables content storage. When true, the default storage engine, `sqlite` will be used to save metadata alongside embeddings vectors.

Client-server connections are supported with either `client` or a full connection URL. When set to `client`, the CLIENT_URL environment variable must be set to the full connection URL.
Client-server connections are supported with either `client` or a full connection URL. When set to `client`, the CLIENT_URL environment variable must be set to the full connection URL. See the [SQLAlchemy](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls) documentation for more information on how to construct connection strings for client-server databases.

Add custom storage engines via setting this parameter to the fully resolvable class string.

Expand Down
20 changes: 15 additions & 5 deletions docs/embeddings/configuration/vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,15 @@ encodebatch: int

Sets the encode batch size. This parameter controls the underlying vector model batch size. This often corresponds to a GPU batch size, which controls GPU memory usage.

## tokenize
## quantize
```yaml
tokenize: boolean
quantize: int|bool
```

Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of
English language sentence embeddings in some situations.
Enables scalar quantization at the specified precision. Supports 1-bit through 8-bit quantization. Scalar quantization transforms continuous floating point values
to discrete unsigned integers.

This parameter supports booleans for backwards compatability. When set to true/false, this flag sets [faiss.quantize](../ann/#faiss).

## instructions
```yaml
Expand All @@ -95,4 +97,12 @@ instructions:

Instruction-based models use prefixes to modify how embeddings are computed. This is especially useful with asymmetric search, which is when the query and indexed data are of vastly different lengths. In other words, short queries with long documents.

[E5-base](https://huggingface.co/intfloat/e5-base) is an example of a model that accepts instructions. It takes `query: ` and `passage: ` prefixes and uses those to generate embeddings that work well for asymmetric search.
[E5-base](https://huggingface.co/intfloat/e5-base) is an example of a model that accepts instructions. It takes `query: ` and `passage: ` prefixes and uses those to generate embeddings that work well for asymmetric search.

## tokenize
```yaml
tokenize: boolean
```

Enables string tokenization (defaults to false). This method applies tokenization rules that only work with English language text and may increase the quality of
English language sentence embeddings in some situations.
2 changes: 1 addition & 1 deletion docs/embeddings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![embeddings](../images/embeddings.png#only-light)
![embeddings](../images/embeddings-dark.png#only-dark)

Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.

The following code snippet shows how to build and search an embeddings index.

Expand Down
22 changes: 17 additions & 5 deletions docs/embeddings/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
![indexing](../images/indexing.png#only-light)
![indexing](../images/indexing-dark.png#only-dark)

This section gives an in-depth overview on how to index data with txtai. We'll cover vectorization, indexing, updating and deleting data.
This section gives an in-depth overview on how to index data with txtai. We'll cover vectorization, indexing/updating/deleting data and the various components of an embeddings database.

## Vectorization

Expand All @@ -13,6 +13,8 @@ The [batch](../configuration/vectors#batch) and [encodebatch](../configuration/v

Data is buffered to temporary storage during indexing as embeddings vectors can be quite large (for example 768 dimensions of float32 is 768 * 4 = 3072 bytes per vector). Once vectorization is complete, a mmapped array is created with all vectors for [Approximate Nearest Neighbor (ANN)](../configuration/vectors#backend) indexing.

The terms `ANN` and `dense vector index` are used interchangeably throughout txtai's documentation.

## Setting a backend

As mentioned above, computed vectors are stored in an ANN. There are various index [backends](../configuration/ann#backend) that can be configured. Faiss is the default backend.
Expand Down Expand Up @@ -104,7 +106,19 @@ embeddings.graph.showpath(id1, id2)

Graphs are persisted alongside an embeddings index. Each save and load will also save and load the graph.

## Scoring
## Sparse vectors

Scoring instances can create a standalone [keyword](../configuration/general#keyword) or sparse index (BM25, TF-IDF). This enables [hybrid](../configuration/general/#hybrid) search when there is an available dense vector index.

The terms `sparse vector index`, `keyword index`, `terms index` and `scoring index` are used interchangeably throughout txtai's documentation.

See [this link](../../examples/#semantic-search) to learn more.

## Subindexes

An embeddings instance can optionally have associated [subindexes](../configuration/general/#indexes), which are also embeddings databases. This enables indexing additional fields, vector models and much more.

## Word vectors

When using [word vector backed models](../configuration/vectors#words) with scoring set, a separate call is required before calling `index` as follows:

Expand All @@ -113,6 +127,4 @@ embeddings.score(rows)
embeddings.index(rows)
```

Two calls are required to support generator-backed iteration of data. The scoring index requires a separate full-pass of the data.

Scoring instances can also create a standalone keyword-based index (BM25, TF-IDF). See [this link](../../examples/#semantic-search) to learn more.
Both calls are required to support generator-backed iteration of data with word vectors models.
7 changes: 6 additions & 1 deletion docs/embeddings/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,19 @@
::: txtai.embeddings.Embeddings
selection:
filters:
- "!columns"
- "!createann"
- "!createcloud"
- "!createdatabase"
- "!creategraph"
- "!createindexes"
- "!createscoring"
- "!checkarchive"
- "!configure"
- "!defaultallowed"
- "!defaults"
- "!initindex"
- "!loadconfig"
- "!loadquery"
- "!loadvectors"
- "!normalize"
- "!saveconfig"
36 changes: 30 additions & 6 deletions docs/embeddings/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,24 +35,30 @@ SELECT id, text, score FROM txtai WHERE similar('feel good story')
SELECT id, text, score FROM txtai WHERE similar('feel good story')
```

The similar clause takes two arguments:
The similar clause takes the following arguments:

```sql
similar("query", "number of candidates")
similar("query", "number of candidates", "index", "weights")
```

| Argument | Description |
| --------------------- | ---------------------------------------|
| query | natural language query to run |
| number of candidates | number of candidate results to return |
| index | target index name |
| weights | hybrid score weights |

The txtai query layer has to join results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.
The txtai query layer joins results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.

The number of candidates should be larger than the desired number of results when applying additional filter clauses. This ensures that `limit` results are still returned after applying additional filters. If the number of candidates is not specified, it is defaulted as follows:

- For a single query filter clause, the default is the query limit
- With multiple filtering clauses, the default is 10x the query limit

The index name is only applicable when [subindexes](../configuration/general/#indexes) are enabled. This specifies the index to use for the query.

Weights sets the hybrid score weights when an index has both a sparse and dense index.

### Dynamic columns

Content can be indexed in multiple ways when content storage is enabled. [Remember that input documents](../#index) take the form of `(id, data, tags)` tuples. If data is a string, then content is primarily filtered with similar clauses. If data is a dictionary, then all fields in the dictionary are indexed and searchable.
Expand Down Expand Up @@ -161,10 +167,28 @@ entry >= date('now', '-1 day')

This requires setting a [query translation model](../configuration/database#query). The default query translation model is [t5-small-txtsql](https://huggingface.co/NeuML/t5-small-txtsql) but this can easily be finetuned to handle different use cases.

## Hybrid search

When an embeddings database has both a sparse and dense index, both indexes will be queried and the results will be equally weighted unless otherwise specified.

```python
embeddings.search("query", weights=0.5)
embeddings.search("select id, text, score from txtai where similar('query', 0.5)")
```

## Subindexes

Subindexes can be queried as follows:

```python
embeddings.search("query", index="subindex1")
embeddings.search("select id, text, score from txtai where similar('query', 'subindex1')")
```

## Combined index architecture

When content storage is enabled, txtai becomes a dual storage engine. Content is stored in an underlying database (currently supports SQLite) along with an Approximate Nearest Neighbor (ANN) index. These components combine to deliver similarity search alongside traditional structured search.
txtai has multiple storage and indexing components. Content is stored in an underlying database along with an approximate nearest neighbor (ANN) index, keyword index and graph network. These components combine to deliver similarity search alongside traditional structured search.

The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is applied. This step takes those ids and effectively inserts them as part of the underlying database query.
The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is executed. This step takes those ids and effectively inserts them as part of the underlying database query. The same steps apply with keyword indexes except a term frequency index is used to find the best matching ids.

Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into `json_extract` clauses. This same concept can be expanded to other storage engines like PostgreSQL and could even work with NoSQL stores.
Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into `json_extract` clauses. Client-server databases are supported via [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) and dynamic columns are supported provided the underlying engine has [JSON](https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqlalchemy.types.JSON) support.

0 comments on commit 3847e0f

Please sign in to comment.