Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Zona-hu · 2024-12-20T08:34:29Z

Elasticsearch Version

8.17.0

Installed Plugins

No response

Java Version

openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)

OS Version

Linux debian-002 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

Problem Description

In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results.
This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0.

Steps to Reproduce

Here are the steps to reproduce the issue:

Create index

curl --location --request PUT 'http://elasticsearch:9200/vector_test' \
--header 'Content-Type: application/json' \
--data '{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 1024,
                "index": true,
                "similarity": "cosine",
                "index_options": {
                    "type": "hnsw",
                    "m": 16,
                    "ef_construction": 100
                }
            }
        }
    },
    "settings": {
        "index": {
            "routing": {
                "allocation": {
                    "include": {
                        "_tier_preference": "data_content"
                    }
                }
            },
            "refresh_interval": "30s",
            "number_of_shards": "1",
            "number_of_replicas": "0"
        }
    }
}'

Write 10,000 random vector values and then force a _refresh.

# -*- coding:utf-8 -*-

import json
import time

import numpy as np
import requests

REFRESH_URL = 'http://elasticsearch:9200/vector_test/_refresh'
BULK_URL = 'http://elasticsearch:9200/vector_test/_bulk'

request = requests.session()

# Generate a random vector with 1024 dimensions, where each value is a floating-point number between -1 and 1
def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def write():
    tmp_str = ''
    count = 0
    for id in range(10000):
        #
        vector = [float32_uniform(-1, 1) for _ in range(1024)]
        data = {'vector': vector}
        tmp_str += '{"index":{"_id":"' + str(id) + '"}}\n' + json.dumps(data) + '\n'
        if count == 1000:
            res = request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str)
            print(res.text)
            tmp_str = ''
            count = 0
            time.sleep(0.2)
        count += 1
    if count != 0 and tmp_str != '':
        print(request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str).json())
    request.post(REFRESH_URL)
    print("write success.")


if __name__ == '__main__':
    write()

Begin testing to reproduce the issue. This experiment is repeated 100 times: for each iteration, a random vector is constructed and requested 100 times.

# -*- coding:utf-8 -*-

import json

import numpy as np
import requests

SEARCH_URL = 'http://elasticsearch:9200/vector_test/_search'

request = requests.session()


def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def request_test(loop_count, k, num_candidates):
    vector = [float32_uniform(-1, 1) for _ in range(1024)]
    body = {"from": 0, "size": 10,
            "knn": {"field": "vector", "query_vector": vector, "k": k, "num_candidates": num_candidates},
            "_source": False}
    result_dict = {}
    for i in range(loop_count):
        response = request.post(url=SEARCH_URL, json=body).json()
        hits = response['hits']['hits']
        hits_str = json.dumps(hits, ensure_ascii=False)
        if hits_str in result_dict:
            result_dict[hits_str] += 1
        else:
            result_dict[hits_str] = 1
    data_list = []
    for res, count in result_dict.items():
        data_list.append({"data": res, "count": count})

    base_count = 0
    for item in sorted(data_list, key=lambda s: s['count'], reverse=True):
        base_count = item['count']
        break
    error_count = loop_count - base_count
    print('{}/{}'.format(base_count, error_count))
    return base_count, error_count


if __name__ == '__main__':
    success = total = 0
    for i in range(100):
        base, error_count = request_test(loop_count=100, k=10, num_candidates=20)
        success += base
        total += base + error_count
    print('{}/{}'.format(success, total))

Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem.

The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

Zona-hu · 2024-12-20T08:35:30Z

If the index is forcibly merged into a single segment with forcemerge, the results become stable again.

elasticsearchmachine · 2024-12-24T09:01:31Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Zona-hu added >bug needs:triage Requires assignment of a team area label labels Dec 20, 2024

Zona-hu closed this as completed Dec 20, 2024

Zona-hu reopened this Dec 20, 2024

gbanasiak added the :Search Relevance/Vectors Vector search label Dec 24, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 24, 2024

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Zona-hu commented Dec 20, 2024

Zona-hu commented Dec 20, 2024

elasticsearchmachine commented Dec 24, 2024

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Comments

Zona-hu commented Dec 20, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Zona-hu commented Dec 20, 2024

elasticsearchmachine commented Dec 24, 2024