Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for k-NN Faiss SQfp16 #6249

Merged
merged 21 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 65 additions & 5 deletions _search-plugins/knn/knn-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,16 @@ The k-NN plugin introduces a custom data type, the `knn_vector`, that allows use

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).

## SIMD optimization for Faiss

Starting with k-NN plugin version 2.13, [SIMD(Single instruction, multiple data)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) is supported by default on Linux machines only for Faiss engine if the underlying processor on the system supports SIMD instructions (`AVX2` on `x64` architecture and `NEON` on `ARM64` architecture) which helps to boost the overall performance.
For x64 architecture, two different versions of Faiss library(`libopensearchknn_faiss.so` and `libopensearchknn_faiss_avx2.so`) are built and shipped with the artifact where the library with `_avx2` suffix has the AVX2 SIMD instructions. During runtime, detects if the underlying system supports AVX2 or not and loads the corresponding library.
naveentatikonda marked this conversation as resolved.
Show resolved Hide resolved

Users can override and disable AVX2 and load the default Faiss library(`libopensearchknn_faiss.so`) even if system supports avx2 by setting `knn.faiss.avx2.disabled`(Static) to `true` in opensearch.yml (which is by default `false`).
naveentatikonda marked this conversation as resolved.
Show resolved Hide resolved
{: .note}

For arm64 architecture, only one Faiss library(`libopensearchknn_faiss.so`) is built and shipped which contains the NEON SIMD instructions and unlike avx2, it can't be disabled.
naveentatikonda marked this conversation as resolved.
Show resolved Hide resolved

## Method definitions

A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).
Expand Down Expand Up @@ -48,7 +58,7 @@ For nmslib, *ef_search* is set in the [index settings](#index-settings).
An index created in OpenSearch version 2.11 or earlier will still use the old `ef_construction` value (`512`).
{: .note}

### Supported faiss methods
### Supported Faiss methods

Method name | Requires training | Supported spaces | Description
:--- | :--- | :--- | :---
Expand Down Expand Up @@ -122,10 +132,10 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e
}
```

### Supported faiss encoders
### Supported Faiss encoders

You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. faiss has
several encoder types, but the plugin currently only supports *flat* and *pq* encoding.
You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. Faiss has
several encoder types, but the plugin currently only supports `flat`, `pq`, and `sq` encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faiss has
several encoder types, but the plugin currently only supports flat, pq, and sq encoding

k-NN plugin currently supports flat, pq, and sq encoders from Faiss library?.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


The following example method definition specifies the `hnsw` method and a `pq` encoder:

Expand Down Expand Up @@ -153,6 +163,7 @@ Encoder name | Requires training | Description
:--- | :--- | :---
`flat` | false | Encode vectors as floating point arrays. This encoding does not reduce memory footprint.
`pq` | true | An abbreviation for _product quantization_, it is a lossy compression technique that uses clustering to encode a vector into a fixed size of bytes, with the goal of minimizing the drop in k-NN search accuracy. At a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more information about product quantization, see [this blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388).
`sq` | false | sq stands for Scalar Quantization. Starting with k-NN plugin version 2.13, you can use the sq encoder(by default [SQFP16]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization#faiss-scalar-quantization-fp16)) to quantize 32-bit floating-point vectors into 16-bit floats by using the built-in Faiss ScalarQuantizer in order to reduce the memory footprint with a minimal loss of precision. Besides optimizing memory use, sq improves the overall performance with the SIMD optimization (using `AVX2` on `x86` architecture and using `NEON` on `ARM` architecture).

#### Examples

Expand Down Expand Up @@ -204,13 +215,62 @@ The following example uses the `hnsw` method without specifying an encoder (by d
}
```

The following example uses the `hnsw` method with a `sq` encoder of type `fp16` with `clip` enabled:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
"method": {
"name":"hnsw",
"engine":"faiss",
"space_type": "l2",
"parameters":{
"encoder": {
"name": "sq",
"parameters": {
"type": "fp16",
"clip": true
}
},
"ef_construction": 256,
"m": 8
}
}
```

The following example uses the `ivf` method with a `sq` encoder of type `fp16`:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
"method": {
"name":"ivf",
"engine":"faiss",
"space_type": "l2",
"parameters":{
"encoder": {
"name": "sq",
"parameters": {
"type": "fp16",
"clip": false
}
},
"nprobes": 2
}
}
```


#### PQ parameters

Paramater Name | Required | Default | Updatable | Description
Parameter name | Required | Default | Updatable | Description
:--- | :--- | :--- | :--- | :---
`m` | false | 1 | false | Determines the number of subvectors into which to break the vector. Subvectors are encoded independently of each other. This dimension of the vector must be divisible by `m`. Maximum value is 1,024.
`code_size` | false | 8 | false | Determines the number of bits into which to encode a subvector. Maximum value is 8. For IVF, this value must be less than or equal to 8. For HNSW, this value can only be 8.

#### SQ parameters

Parameter name | Required | Default | Updatable | Description
:--- | :--- | :-- | :--- | :---
`type` | false | fp16 | false | Determines the type of scalar quantization to be used to encode the 32 bit float vectors into the corresponding type. By default, it is `fp16`.
Copy link
Collaborator

@natebower natebower Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`type` | false | fp16 | false | Determines the type of scalar quantization to be used to encode the 32 bit float vectors into the corresponding type. By default, it is `fp16`.
`type` | `false` | `fp16` | `false` | Determines the type of scalar quantization used to encode the 32-bit float vectors into the corresponding type. Default is `fp16`.

`clip` | false | false | false | When set to `true`, clips the vectors that are outside of the range to bring them into the range. If it is `false` and any vector element is out of range, then it rejects the request and throws an exception.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

### Choosing the right method

There are a lot of options to choose from when building your `knn_vector` field. To determine the correct methods and parameters to choose, you should first understand what requirements you have for your workload and what trade-offs you are willing to make. Factors to consider are (1) query latency, (2) query quality, (3) memory limits, (4) indexing latency.
Expand Down
103 changes: 103 additions & 0 deletions _search-plugins/knn/knn-vector-quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
naveentatikonda marked this conversation as resolved.
Show resolved Hide resolved
layout: default
title: k-NN vector quantization
nav_order: 50
parent: k-NN search
grand_parent: Search methods
has_children: false
has_math: true
---

# k-NN vector quantization

The OpenSearch k-NN plugin by default supports the indexing and querying of vectors of type float where each dimension of the vector occupies 4 bytes of memory. This is getting expensive in terms of memory for use cases that requires ingestion on a large scale where we need to construct, load, save and search graphs(for native engines `nmslib` and `faiss`) which is getting even more costlier. To reduce these memory footprints, we can use these vector quantization features supported by k-NN plugin.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence here is muddled and needs some revision. Please tag me on the rewrite.


## Lucene byte vector

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of memory needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

## Faiss scalar quantization fp16

Starting with k-NN plugin version 2.13, users can ingest `fp16` vectors with `faiss` engine where when user provides the 32 bit float vectors, the Faiss engine quantize the vector into FP16 using scalar quantization (users don’t need to do any quantization on their end), stores it and decodes it back to FP32 for distance computation during search operations. Using this feature, users can
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this point forward, it looks like we need to do a little more cleanup and revision.

reduce memory footprints by a factor of 2, significant reduction in search latencies (with [SIMD Optimization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#simd-optimization-for-faiss)), with a very minimal loss in recall(depends on distribution of vectors).

To use this feature, users needs to set `encoder` name as `sq` and to know the type of quantization in SQ, we are introducing a new optional field, `type` in the encoder parameters. The data indexed by users should be within the FP16 range of [-65504.0, 65504.0]. If the data lies out of this range then an exception is thrown and the request is rejected.

We also introduced another optional encoder parameter `clip` and if this is set to `true`(by default `false`) in the index mapping, then if the data lies out of FP16 range it will be clipped to the MIN(`-65504.0`) and MAX(`65504.0`) of FP16 range and ingested into the index without throwing any exception. But, clipping the values might cause a drop in recall.

For Example - when `clip` is set to `true`, `65510.82` will be clipped and indexed as `65504.0` and `-65504.1` will be clipped and indexed as `-65504.0`.

Ideally, `clip` parameter is recommended to be set as `true` only when most of the vector elements are within the fp16 range and very few elements lies outside of the range.
{: .note}

* `type` - Set this as `fp16` if we want to quantize the indexed vectors into fp16 using Faiss SQFP16; Default value is `fp16`.
* `clip` - Set this as `true` if you want to skip the FP16 validation check and clip vector value to bring it into FP16 MIN or MAX range. If it is `false` and any vector element is out of range, then it rejects the request and throws an exception; Default value is `false`.

This is an example of a method definition using Faiss SQfp16 with `clip` as `true`
```json
"method": {
"name":"hnsw",
"engine":"faiss",
"space_type": "l2",
"parameters":{
"encoder":{
"name":"sq",
"parameters":{
"type": "fp16",
"clip": true
}
}
}
}

```

During ingestion, make sure each dimension of the vector is in the supported range [-65504.0, 65504.0] if `clip` is set as `false`:
```json
PUT test-index/_doc/1
{
"my_vector1": [-65504.0, 65503.845, 55.82]
}
```

During querying, there is no range limitation for query vector:
```json
GET test-index/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [265436.876, -120906.256, 99.84],
"k": 2
}
}
}
}
```

### Memory estimation

Ideally, Faiss SQfp16 requires 50% of the memory consumed by FP32 vectors.

#### HNSW memory estimation

The memory required for HNSW is estimated to be `1.1 * (2 * dimension + 8 * M)` bytes/vector.

As an example, assume you have a million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```
1.1 * (2 * 256 + 8 * 16) * 1,000,000 ~= 0.656 GB
```

#### IVF memory estimation

The memory required for IVF is estimated to be `1.1 * (((2 * dimension) * num_vectors) + (4 * nlist * d))` bytes.
naveentatikonda marked this conversation as resolved.
Show resolved Hide resolved

As an example, assume you have a million vectors with a dimension of 256 and `nlist` of 128. The memory requirement can be estimated as follows:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```
1.1 * (((2 * 256) * 1,000,000) + (4 * 128 * 256)) ~= 0.525 GB

```

Loading