opensearch-project · kolchfa-aws · Mar 29, 2024 · Jan 24, 2024 · Mar 19, 2024 · Mar 19, 2024
@@ -15,6 +15,16 @@ The k-NN plugin introduces a custom data type, the `knn_vector`, that allows use
 
 Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
 
+## SIMD optimization for Faiss
+
+Starting with k-NN plugin version 2.13, [SIMD(Single instruction, multiple data)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) is supported by default on Linux machines only for Faiss engine if the underlying processor on the system supports SIMD instructions (`AVX2` on `x64` architecture and `NEON` on `ARM64` architecture) which helps to boost the overall performance. 
+For x64 architecture, two different versions of Faiss library(`libopensearchknn_faiss.so` and `libopensearchknn_faiss_avx2.so`) are built and shipped with the artifact where the library with `_avx2` suffix has the AVX2 SIMD instructions. During runtime, detects if the underlying system supports AVX2 or not and loads the corresponding library.
+
+Users can override and disable AVX2 and load the default Faiss library(`libopensearchknn_faiss.so`) even if system supports avx2 by setting `knn.faiss.avx2.disabled`(Static) to `true` in opensearch.yml (which is by default `false`).
+{: .note}
+
+For arm64 architecture, only one Faiss library(`libopensearchknn_faiss.so`) is built and shipped which contains the NEON SIMD instructions and unlike avx2, it can't be disabled. 
+
 ## Method definitions
 
 A method definition refers to the underlying configuration of the Approximate k-NN algorithm you want to use. Method definitions are used to either create a `knn_vector` field (when the method does not require training) or [create a model during training]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model) that can then be used to [create a `knn_vector` field]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).
@@ -48,7 +58,7 @@ For nmslib, *ef_search* is set in the [index settings](#index-settings).
 An index created in OpenSearch version 2.11 or earlier will still use the old `ef_construction` value (`512`).
 {: .note}
 
-### Supported faiss methods
+### Supported Faiss methods
 
 Method name | Requires training | Supported spaces | Description
 :--- | :--- | :--- | :---
@@ -122,10 +132,10 @@ An index created in OpenSearch version 2.11 or earlier will still use the old `e
 }
 ```
 
-### Supported faiss encoders
+### Supported Faiss encoders
 
-You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. faiss has
-several encoder types, but the plugin currently only supports *flat* and *pq* encoding.
+You can use encoders to reduce the memory footprint of a k-NN index at the expense of search accuracy. Faiss has
+several encoder types, but the plugin currently only supports `flat`, `pq`, and `sq` encoding.
 
 The following example method definition specifies the `hnsw` method and a `pq` encoder:
 
@@ -153,6 +163,7 @@ Encoder name | Requires training | Description
 :--- | :--- | :---
 `flat` | false | Encode vectors as floating point arrays. This encoding does not reduce memory footprint.
 `pq` | true | An abbreviation for _product quantization_, it is a lossy compression technique that uses clustering to encode a vector into a fixed size of bytes, with the goal of minimizing the drop in k-NN search accuracy. At a high level, vectors are broken up into `m` subvectors, and then each subvector is represented by a `code_size` code obtained from a code book produced during training. For more information about product quantization, see [this blog post](https://medium.com/dotstar/understanding-faiss-part-2-79d90b1e5388).
+`sq` | false | sq stands for Scalar Quantization. Starting with k-NN plugin version 2.13, you can use the sq encoder(by default [SQFP16]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-vector-quantization#faiss-scalar-quantization-fp16)) to quantize 32-bit floating-point vectors into 16-bit floats by using the built-in Faiss ScalarQuantizer in order to reduce the memory footprint with a minimal loss of precision. Besides optimizing memory use, sq improves the overall performance with the SIMD optimization (using `AVX2` on `x86` architecture and using `NEON` on `ARM` architecture).
 
 #### Examples
 
@@ -204,13 +215,62 @@ The following example uses the `hnsw` method without specifying an encoder (by d
 }
 ```
 
+The following example uses the `hnsw` method with a `sq` encoder of type `fp16` with `clip` enabled:
+
+```json
+"method": {
+  "name":"hnsw",
+  "engine":"faiss",
+  "space_type": "l2",
+  "parameters":{
+    "encoder": {
+      "name": "sq",
+      "parameters": {
+        "type": "fp16",
+        "clip": true
+      }  
+    },    
+    "ef_construction": 256,
+    "m": 8
+  }
+}
+```
+
+The following example uses the `ivf` method with a `sq` encoder of type `fp16`:
+
+```json
+"method": {
+  "name":"ivf",
+  "engine":"faiss",
+  "space_type": "l2",
+  "parameters":{
+    "encoder": {
+      "name": "sq",
+      "parameters": {
+        "type": "fp16",
+        "clip": false
+      }
+    },
+    "nprobes": 2
+  }
+ }
+```
+
+
 #### PQ parameters
 
-Paramater Name | Required | Default | Updatable | Description
+Parameter name | Required | Default | Updatable | Description
 :--- | :--- | :--- | :--- | :---
 `m` | false | 1 | false |  Determines the number of subvectors into which to break the vector. Subvectors are encoded independently of each other. This dimension of the vector must be divisible by `m`. Maximum value is 1,024.
 `code_size` | false | 8 | false | Determines the number of bits into which to encode a subvector. Maximum value is 8. For IVF, this value must be less than or equal to 8. For HNSW, this value can only be 8.
 
+#### SQ parameters
+
+Parameter name | Required | Default | Updatable | Description
+:--- | :--- | :-- | :--- | :---
+`type` | false | fp16 | false |  Determines the type of scalar quantization to be used to encode the 32 bit float vectors into the corresponding type. By default, it is `fp16`.
-`type` | false | fp16 | false |  Determines the type of scalar quantization to be used to encode the 32 bit float vectors into the corresponding type. By default, it is `fp16`.
+`type` | `false` | `fp16` | `false` |  Determines the type of scalar quantization used to encode the 32-bit float vectors into the corresponding type. Default is `fp16`.
-`type` | false | fp16 | false |  Determines the type of scalar quantization to be used to encode the 32 bit float vectors into the corresponding type. By default, it is `fp16`.
+`type` | `false` | `fp16` | `false` |  Determines the type of scalar quantization used to encode the 32-bit float vectors into the corresponding type. Default is `fp16`.
+`clip` | false | false | false | When set to `true`, clips the vectors that are outside of the range to bring them into the range. If it is `false` and any vector element is out of range, then it rejects the request and throws an exception. 
+
 ### Choosing the right method
 
 There are a lot of options to choose from when building your `knn_vector` field. To determine the correct methods and parameters to choose, you should first understand what requirements you have for your workload and what trade-offs you are willing to make. Factors to consider are (1) query latency, (2) query quality, (3) memory limits, (4) indexing latency.

@@ -0,0 +1,103 @@
+---
+layout: default
+title: k-NN vector quantization
+nav_order: 50
+parent: k-NN search
+grand_parent: Search methods
+has_children: false
+has_math: true
+---
+
+# k-NN vector quantization
+
+The OpenSearch k-NN plugin by default supports the indexing and querying of vectors of type float where each dimension of the vector occupies 4 bytes of memory. This is getting expensive in terms of memory for use cases that requires ingestion on a large scale where we need to construct, load, save and search graphs(for native engines `nmslib` and `faiss`) which is getting even more costlier. To reduce these memory footprints, we can use these vector quantization features supported by k-NN plugin.
+
+## Lucene byte vector
+
+Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine in order to reduce the amount of memory needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
+
+## Faiss scalar quantization fp16
+
+Starting with k-NN plugin version 2.13, users can ingest `fp16` vectors with `faiss` engine where when user provides the 32 bit float vectors, the Faiss engine quantize the vector into FP16 using scalar quantization (users don’t need to do any quantization on their end), stores it and decodes it back to FP32 for distance computation during search operations. Using this feature, users can
+reduce memory footprints by a factor of 2, significant reduction in search latencies (with [SIMD Optimization]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#simd-optimization-for-faiss)), with a very minimal loss in recall(depends on distribution of vectors).
+
+To use this feature, users needs to set `encoder` name as `sq` and to know the type of quantization in SQ, we are introducing a new optional field, `type` in the encoder parameters. The data indexed by users should be within the FP16 range of [-65504.0, 65504.0]. If the data lies out of this range then an exception is thrown and the request is rejected.
+
+We also introduced another optional encoder parameter `clip`  and if this is set to `true`(by default `false`) in the index mapping, then if the data lies out of FP16 range it will be clipped to the MIN(`-65504.0`) and MAX(`65504.0`) of FP16 range and ingested into the index without throwing any exception. But, clipping the values might cause a drop in recall.
+
+For Example - when `clip` is set to `true`, `65510.82` will be clipped and indexed as `65504.0` and `-65504.1` will be clipped and indexed as `-65504.0`.
+
+Ideally, `clip` parameter is recommended to be set as `true` only when most of the vector elements are within the fp16 range and very few elements lies outside of the range.
+{: .note}
+
+* `type`  - Set this as `fp16` if we want to quantize the indexed vectors into fp16 using Faiss SQFP16; Default value is `fp16`.
+* `clip` - Set this as `true` if you want to skip the FP16 validation check and clip vector value to bring it into FP16 MIN or MAX range. If it is `false` and any vector element is out of range, then it rejects the request and throws an exception; Default value is `false`.
+
+This is an example of a method definition using Faiss SQfp16 with `clip` as `true`
+```json
+"method": {
+  "name":"hnsw",
+  "engine":"faiss",
+  "space_type": "l2",
+  "parameters":{
+    "encoder":{
+      "name":"sq",
+      "parameters":{
+        "type": "fp16",
+        "clip": true
+      }
+    }
+  }
+}
+
+```
+
+During ingestion, make sure each dimension of the vector is in the supported range [-65504.0, 65504.0] if `clip` is set as `false`:
+```json
+PUT test-index/_doc/1
+{
+  "my_vector1": [-65504.0, 65503.845, 55.82]
+}
+```
+
+During querying, there is no range limitation for query vector:
+```json
+GET test-index/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [265436.876, -120906.256, 99.84],
+        "k": 2
+      }
+    }
+  }
+}
+```
+
+### Memory estimation
+
+Ideally, Faiss SQfp16 requires 50% of the memory consumed by FP32 vectors. 
+
+#### HNSW memory estimation
+
+The memory required for HNSW is estimated to be `1.1 * (2 * dimension + 8 * M)` bytes/vector.
+
+As an example, assume you have a million vectors with a dimension of 256 and M of 16. The memory requirement can be estimated as follows:
+
+```
+1.1 * (2 * 256 + 8 * 16) * 1,000,000 ~= 0.656 GB
+```
+
+#### IVF memory estimation
+
+The memory required for IVF is estimated to be `1.1 * (((2 * dimension) * num_vectors) + (4 * nlist * d))` bytes.
+
+As an example, assume you have a million vectors with a dimension of 256 and `nlist` of 128. The memory requirement can be estimated as follows:
+
+```
+1.1 * (((2 * 256) * 1,000,000) + (4 * 128 * 256))  ~= 0.525 GB
+
+```
+