Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Zona-hu · 2024-12-20T08:34:29Z

Elasticsearch Version

8.17.0

Installed Plugins

No response

Java Version

openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)

OS Version

Linux debian-002 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

Problem Description

In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results.
This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0.

Steps to Reproduce

Here are the steps to reproduce the issue:

Create index

curl --location --request PUT 'http://elasticsearch:9200/vector_test' \
--header 'Content-Type: application/json' \
--data '{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 1024,
                "index": true,
                "similarity": "cosine",
                "index_options": {
                    "type": "hnsw",
                    "m": 16,
                    "ef_construction": 100
                }
            }
        }
    },
    "settings": {
        "index": {
            "routing": {
                "allocation": {
                    "include": {
                        "_tier_preference": "data_content"
                    }
                }
            },
            "refresh_interval": "30s",
            "number_of_shards": "1",
            "number_of_replicas": "0"
        }
    }
}'

Write 10,000 random vector values and then force a _refresh.

# -*- coding:utf-8 -*-

import json
import time

import numpy as np
import requests

REFRESH_URL = 'http://elasticsearch:9200/vector_test/_refresh'
BULK_URL = 'http://elasticsearch:9200/vector_test/_bulk'

request = requests.session()

# Generate a random vector with 1024 dimensions, where each value is a floating-point number between -1 and 1
def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def write():
    tmp_str = ''
    count = 0
    for id in range(10000):
        #
        vector = [float32_uniform(-1, 1) for _ in range(1024)]
        data = {'vector': vector}
        tmp_str += '{"index":{"_id":"' + str(id) + '"}}\n' + json.dumps(data) + '\n'
        if count == 1000:
            res = request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str)
            print(res.text)
            tmp_str = ''
            count = 0
            time.sleep(0.2)
        count += 1
    if count != 0 and tmp_str != '':
        print(request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str).json())
    request.post(REFRESH_URL)
    print("write success.")


if __name__ == '__main__':
    write()

Begin testing to reproduce the issue. This experiment is repeated 100 times: for each iteration, a random vector is constructed and requested 100 times.

# -*- coding:utf-8 -*-

import json

import numpy as np
import requests

SEARCH_URL = 'http://elasticsearch:9200/vector_test/_search'

request = requests.session()


def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def request_test(loop_count, k, num_candidates):
    vector = [float32_uniform(-1, 1) for _ in range(1024)]
    body = {"from": 0, "size": 10,
            "knn": {"field": "vector", "query_vector": vector, "k": k, "num_candidates": num_candidates},
            "_source": False}
    result_dict = {}
    for i in range(loop_count):
        response = request.post(url=SEARCH_URL, json=body).json()
        hits = response['hits']['hits']
        hits_str = json.dumps(hits, ensure_ascii=False)
        if hits_str in result_dict:
            result_dict[hits_str] += 1
        else:
            result_dict[hits_str] = 1
    data_list = []
    for res, count in result_dict.items():
        data_list.append({"data": res, "count": count})

    base_count = 0
    for item in sorted(data_list, key=lambda s: s['count'], reverse=True):
        base_count = item['count']
        break
    error_count = loop_count - base_count
    print('{}/{}'.format(base_count, error_count))
    return base_count, error_count


if __name__ == '__main__':
    success = total = 0
    for i in range(100):
        base, error_count = request_test(loop_count=100, k=10, num_candidates=20)
        success += base
        total += base + error_count
    print('{}/{}'.format(success, total))

Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem.

The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

Zona-hu · 2024-12-20T08:35:30Z

If the index is forcibly merged into a single segment with forcemerge, the results become stable again.

elasticsearchmachine · 2024-12-24T09:01:31Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Zona-hu · 2025-01-02T02:03:52Z

@gbanasiak @elasticsearchmachine
Could you please confirm if this issue has been reproduced and verified?

tteofili · 2025-01-02T14:05:00Z

I tested this on main over 3 different runs of the attached scripts.

I got:
10000/10000
9977/10000
9977/10000
so it seems like this behavior is still present.

tteofili · 2025-01-16T17:02:15Z

after a few tests across different versions, it seems this non-determinism was introduced in 8.13 as part of apache/lucene#12962 .
one way to mitigate this issue is to increase num_candidates , e.g., with the current scripts, setting it to 500 makes results consistent on 9/10 runs on 8.13 too.
(thanks @benwtrent for helping me out here).

Zona-hu · 2025-01-21T07:37:22Z

after a few tests across different versions, it seems this non-determinism was introduced in 8.13 as part of apache/lucene#12962 . one way to mitigate this issue is to increase num_candidates , e.g., with the current scripts, setting it to 500 makes results consistent on 9/10 runs on 8.13 too. (thanks @benwtrent for helping me out here).

Yes, I have also tested that increasing num_candidates can alleviate the problem. However, our business involves data at the billion level, with 1024-dimensional vectors, 2T of storage, and continuous data ingestion. There are over a thousand segments. For knn + filter search, increasing num_candidates does indeed play a certain role in alleviating the issue, but the consistency problem still exists.

In order to secure my year-end bonus, I had to add caching at the business level to ensure the consistency of search results in a short period of time. I believe that a search engine should ensure the consistency of results, and on this basis, further optimize performance and recall rate.

tteofili · 2025-01-21T13:55:40Z

I believe that a search engine should ensure the consistency of results, and on this basis, further optimize performance and recall rate.

this is a sensitive subject, and different folks might have slightly different opinions here. However apache/lucene#12962 introduces some non-determinism that depends on the concurrent execution of search threads, which is probably not possible to constrain. The other option we have is to see whether it's possible to enable/disable that.

benwtrent · 2025-01-21T15:08:29Z

Agreed, we should make this better. Either by fixing the information sharing or better handling concurrency as a whole.

We should also open a bug in Apache Lucene that more precisely describes the issue and we can work on it.

tveasey · 2025-01-24T09:39:13Z

There are ways to share information which leads to deterministic results at the expense of some synchronisation overhead. I'd written up some notes on this previously.

One thing I always come back to on this sort of "problem" is does determinism actually matter?

From a philosophical standpoint there are many somewhat random processes which lead to the exact order in which vectors match a query (starting with random weight initialisation followed by stochastic gradient descent in the model training). From this perspective any of these results sets are probably similarly valid. It also isn't clear there are really use cases for repeatedly running a query and comparing results.

I guess the counterargument is it might make debugging systems which include this component harder. However, this behaviour has been present for almost a year and no one has actually noticed or cared to report it.

peter-strsr · 2025-01-27T15:22:39Z

@tveasey I see your point, but there are real counter arguments why deterministic behavior is important.

I know of people who have their integration tests failing, because of this non-deterministic behavior, which is frustrating for the development teams.
I know of use cases where customers are using knn for e-commerce search and running concurrent requests for their aggregations, which are created according to some business logic. These aggregations are expected to match with the results that are being displayed to the user.
I think for some use cases you can use post-filters and do everything in one request. But sometimes it even requires multiple steps. One thing I can think of is a percentiles aggregation and then a bucket aggregation to display price buckets to the customer, depending on the distribution of prices in the result set.
End users might be confused why they are seeing different results, when executing a query multiple times. It's even more confusing when they see 5 results and I filter it by xyz (3) and have 4 results after that.

tveasey · 2025-01-27T16:38:17Z

Thanks for the extra context @peter-strsr. I wanted to play devils advocate regarding how important this is to make sure we need to spend the effort (and potentially take a hit on performance). These do seem like valid considerations, although I would say part of it could be addressed by user education (assuming actually quality from run to run isn't worse). In any case I can see a case for an option, or even switching to deterministic approach if it can be made almost as fast.

Zona-hu added >bug needs:triage Requires assignment of a team area label labels Dec 20, 2024

Zona-hu closed this as completed Dec 20, 2024

Zona-hu reopened this Dec 20, 2024

gbanasiak added the :Search Relevance/Vectors Vector search label Dec 24, 2024

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 24, 2024

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 24, 2024

benwtrent added the priority:normal A label for assessing bug priority to be used by ES engineers label Jan 9, 2025

benwtrent assigned tteofili Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Zona-hu commented Dec 20, 2024

Zona-hu commented Dec 20, 2024

elasticsearchmachine commented Dec 24, 2024

Zona-hu commented Jan 2, 2025

tteofili commented Jan 2, 2025

tteofili commented Jan 16, 2025

Zona-hu commented Jan 21, 2025

tteofili commented Jan 21, 2025

benwtrent commented Jan 21, 2025

tveasey commented Jan 24, 2025 •

edited

Loading

peter-strsr commented Jan 27, 2025

tveasey commented Jan 27, 2025

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Comments

Zona-hu commented Dec 20, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Zona-hu commented Dec 20, 2024

elasticsearchmachine commented Dec 24, 2024

Zona-hu commented Jan 2, 2025

tteofili commented Jan 2, 2025

tteofili commented Jan 16, 2025

Zona-hu commented Jan 21, 2025

tteofili commented Jan 21, 2025

benwtrent commented Jan 21, 2025

tveasey commented Jan 24, 2025 • edited Loading

peter-strsr commented Jan 27, 2025

tveasey commented Jan 27, 2025

tveasey commented Jan 24, 2025 •

edited

Loading