Skip to content

Latest commit

 

History

History
212 lines (182 loc) · 7.73 KB

knn-score-script.md

File metadata and controls

212 lines (182 loc) · 7.73 KB
layout title nav_order parent has_children has_math
default
Exact k-NN with Scoring Script
2
k-NN
false
true

Exact k-NN with Scoring Script

The k-NN plugin implements the Elasticsearch score script plugin that you can use to find the exact k-nearest neighbors to a given query point. Using the k-NN score script, you can apply a filter on an index before executing the nearest neighbor search. This is useful for dynamic search cases where the index body may vary based on other conditions. Because this approach executes a brute force search, it does not scale as well as the Approximate approach. In some cases, it may be better to think about refactoring your workflow or index structure to use the Approximate approach instead of this approach.

Getting started with the score script

Similar to approximate nearest neighbor search, in order to use the score script on a body of vectors, you must first create an index with one or more knn_vector fields. If you intend to just use the script score approach (and not the approximate approach) index.knn can be set to false and index.knn.space_type does not need to be set. The space type can be chosen during search. See the spaces section to see what spaces the k-NN score script suppports. Here is an example that creates an index with two knn_vector fields:

PUT my-knn-index-1
{
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      },
      "my_vector2": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}

Note -- For binary spaces, such as the Hamming bit space, type needs to be either binary or long. The binary data then needs to be encoded either as a base64 string or as a long (if the data is 64 bits or less).

If you only want to use the score script, you can omit "index.knn": true. The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard k-NN queries on the index. {: .tip}

After you create the index, you can add some data to it:

POST _bulk
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{ "my_vector1": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{ "my_vector1": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{ "my_vector1": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{ "my_vector1": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
{ "my_vector1": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
{ "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

Finally, you can execute an exact nearest neighbor search on the data using the knn script:

GET my-knn-index-1/_search
{
 "size": 4,
 "query": {
   "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "my_vector2",
         "query_value": [2.0, 3.0, 5.0, 6.0],
         "space_type": "cosinesimil"
       }
     }
   }
 }
}

All parameters are required.

  • lang is the script type. This value is usually painless, but here you must specify knn.

  • source is the name of the script, knn_score.

    This script is part of the k-NN plugin and isn't available at the standard _scripts path. A GET request to _cluster/state/metadata doesn't return it, either.

  • field is the field that contains your vector data.

  • query_value is the point you want to find the nearest neighbors for. For the Euclidean and cosine similarity spaces, the value must be an array of floats that matches the dimension set in the field's mapping. For Hamming bit distance, this value can be either of type signed long or a base64-encoded string (for the long and binary field types, respectively).

  • space_type corresponds to the distance function. See the spaces section.

Note -- After ODFE 1.11, vector was replaced by query_value due to the addition of the bithamming space.

The post filter example in the approximate approach shows a search that returns fewer than k results. If you want to avoid this situation, the score script method lets you essentially invert the order of events. In other words, you can filter down the set of documents you want to execute the k-nearest neighbor search over.

This example shows a pre-filter approach to k-NN search with the score script approach. First, create the index:

PUT my-knn-index-2
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 2
      },
      "color": {
        "type": "keyword"
      }
    }
  }
}

Then add some documents:

POST _bulk
{ "index": { "_index": "my-knn-index-2", "_id": "1" } }
{ "my_vector": [1, 1], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "2" } }
{ "my_vector": [2, 2], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "3" } }
{ "my_vector": [3, 3], "color" : "RED" }
{ "index": { "_index": "my-knn-index-2", "_id": "4" } }
{ "my_vector": [10, 10], "color" : "BLUE" }
{ "index": { "_index": "my-knn-index-2", "_id": "5" } }
{ "my_vector": [20, 20], "color" : "BLUE" }
{ "index": { "_index": "my-knn-index-2", "_id": "6" } }
{ "my_vector": [30, 30], "color" : "BLUE" }

Finally, use the script_score query to pre-filter your documents before identifying nearest neighbors:

GET my-knn-index-2/_search
{
  "size": 2,
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": {
            "term": {
              "color": "BLUE"
            }
          }
        }
      },
      "script": {
        "lang": "knn",
        "source": "knn_score",
        "params": {
          "field": "my_vector",
          "query_value": [9.9, 9.9],
          "space_type": "l2"
        }
      }
    }
  }
}

Spaces

A space corresponds to the function used to measure the distance between 2 points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. We include the conversions to Elasticsearch scores in the table below:

spaceType Distance Function Elasticsearch Score
l2 \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i)^2 \] 1 / (1 + Distance Function)
l1 \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i) \] 1 / (1 + Distance Function)
cosinesimil \[ {A · B \over \|A\| · \|B\|} = {\sum_{i=1}^n (A_i · B_i) \over \sqrt{\sum_{i=1}^n A_i^2} · \sqrt{\sum_{i=1}^n B_i^2}}\] where \(\|A\|\) and \(\|B\|\) represent normalized vectors. 1 + Distance Function
hammingbit Distance = countSetBits(X \(\oplus\) Y) 1 / (1 + Distance Function)

Cosine similarity returns a number between -1 and 1, and because Elasticsearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score.