Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180

Open
Zona-hu opened this issue Dec 20, 2024 · 11 comments
Assignees
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@Zona-hu
Copy link

Zona-hu commented Dec 20, 2024

Elasticsearch Version

8.17.0

Installed Plugins

No response

Java Version

openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)

OS Version

Linux debian-002 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux

Problem Description

In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results.
This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0.

Steps to Reproduce

Here are the steps to reproduce the issue:

  1. Create index
curl --location --request PUT 'http://elasticsearch:9200/vector_test' \
--header 'Content-Type: application/json' \
--data '{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 1024,
                "index": true,
                "similarity": "cosine",
                "index_options": {
                    "type": "hnsw",
                    "m": 16,
                    "ef_construction": 100
                }
            }
        }
    },
    "settings": {
        "index": {
            "routing": {
                "allocation": {
                    "include": {
                        "_tier_preference": "data_content"
                    }
                }
            },
            "refresh_interval": "30s",
            "number_of_shards": "1",
            "number_of_replicas": "0"
        }
    }
}'
  1. Write 10,000 random vector values and then force a _refresh.
# -*- coding:utf-8 -*-

import json
import time

import numpy as np
import requests

REFRESH_URL = 'http://elasticsearch:9200/vector_test/_refresh'
BULK_URL = 'http://elasticsearch:9200/vector_test/_bulk'

request = requests.session()

# Generate a random vector with 1024 dimensions, where each value is a floating-point number between -1 and 1
def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def write():
    tmp_str = ''
    count = 0
    for id in range(10000):
        #
        vector = [float32_uniform(-1, 1) for _ in range(1024)]
        data = {'vector': vector}
        tmp_str += '{"index":{"_id":"' + str(id) + '"}}\n' + json.dumps(data) + '\n'
        if count == 1000:
            res = request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str)
            print(res.text)
            tmp_str = ''
            count = 0
            time.sleep(0.2)
        count += 1
    if count != 0 and tmp_str != '':
        print(request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str).json())
    request.post(REFRESH_URL)
    print("write success.")


if __name__ == '__main__':
    write()

  1. Begin testing to reproduce the issue. This experiment is repeated 100 times: for each iteration, a random vector is constructed and requested 100 times.
# -*- coding:utf-8 -*-

import json

import numpy as np
import requests

SEARCH_URL = 'http://elasticsearch:9200/vector_test/_search'

request = requests.session()


def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def request_test(loop_count, k, num_candidates):
    vector = [float32_uniform(-1, 1) for _ in range(1024)]
    body = {"from": 0, "size": 10,
            "knn": {"field": "vector", "query_vector": vector, "k": k, "num_candidates": num_candidates},
            "_source": False}
    result_dict = {}
    for i in range(loop_count):
        response = request.post(url=SEARCH_URL, json=body).json()
        hits = response['hits']['hits']
        hits_str = json.dumps(hits, ensure_ascii=False)
        if hits_str in result_dict:
            result_dict[hits_str] += 1
        else:
            result_dict[hits_str] = 1
    data_list = []
    for res, count in result_dict.items():
        data_list.append({"data": res, "count": count})

    base_count = 0
    for item in sorted(data_list, key=lambda s: s['count'], reverse=True):
        base_count = item['count']
        break
    error_count = loop_count - base_count
    print('{}/{}'.format(base_count, error_count))
    return base_count, error_count


if __name__ == '__main__':
    success = total = 0
    for i in range(100):
        base, error_count = request_test(loop_count=100, k=10, num_candidates=20)
        success += base
        total += base + error_count
    print('{}/{}'.format(success, total))

Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem.
Image

The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%.
Image

Logs (if relevant)

No response

@Zona-hu Zona-hu added >bug needs:triage Requires assignment of a team area label labels Dec 20, 2024
@Zona-hu
Copy link
Author

Zona-hu commented Dec 20, 2024

If the index is forcibly merged into a single segment with forcemerge, the results become stable again.

@Zona-hu Zona-hu closed this as completed Dec 20, 2024
@Zona-hu Zona-hu reopened this Dec 20, 2024
@gbanasiak gbanasiak added the :Search Relevance/Vectors Vector search label Dec 24, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 24, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Dec 24, 2024
@Zona-hu
Copy link
Author

Zona-hu commented Jan 2, 2025

@gbanasiak @elasticsearchmachine
Could you please confirm if this issue has been reproduced and verified?

@tteofili
Copy link
Contributor

tteofili commented Jan 2, 2025

I tested this on main over 3 different runs of the attached scripts.

I got:
10000/10000
9977/10000
9977/10000
so it seems like this behavior is still present.

@benwtrent benwtrent added the priority:normal A label for assessing bug priority to be used by ES engineers label Jan 9, 2025
@tteofili
Copy link
Contributor

after a few tests across different versions, it seems this non-determinism was introduced in 8.13 as part of apache/lucene#12962 .
one way to mitigate this issue is to increase num_candidates , e.g., with the current scripts, setting it to 500 makes results consistent on 9/10 runs on 8.13 too.
(thanks @benwtrent for helping me out here).

@Zona-hu
Copy link
Author

Zona-hu commented Jan 21, 2025

after a few tests across different versions, it seems this non-determinism was introduced in 8.13 as part of apache/lucene#12962 . one way to mitigate this issue is to increase num_candidates , e.g., with the current scripts, setting it to 500 makes results consistent on 9/10 runs on 8.13 too. (thanks @benwtrent for helping me out here).

Yes, I have also tested that increasing num_candidates can alleviate the problem. However, our business involves data at the billion level, with 1024-dimensional vectors, 2T of storage, and continuous data ingestion. There are over a thousand segments. For knn + filter search, increasing num_candidates does indeed play a certain role in alleviating the issue, but the consistency problem still exists.

In order to secure my year-end bonus, I had to add caching at the business level to ensure the consistency of search results in a short period of time. I believe that a search engine should ensure the consistency of results, and on this basis, further optimize performance and recall rate.

@tteofili
Copy link
Contributor

I believe that a search engine should ensure the consistency of results, and on this basis, further optimize performance and recall rate.

this is a sensitive subject, and different folks might have slightly different opinions here. However apache/lucene#12962 introduces some non-determinism that depends on the concurrent execution of search threads, which is probably not possible to constrain. The other option we have is to see whether it's possible to enable/disable that.

@benwtrent
Copy link
Member

Agreed, we should make this better. Either by fixing the information sharing or better handling concurrency as a whole.

We should also open a bug in Apache Lucene that more precisely describes the issue and we can work on it.

@tveasey
Copy link
Contributor

tveasey commented Jan 24, 2025

There are ways to share information which leads to deterministic results at the expense of some synchronisation overhead. I'd written up some notes on this previously.

One thing I always come back to on this sort of "problem" is does determinism actually matter?

From a philosophical standpoint there are many somewhat random processes which lead to the exact order in which vectors match a query (starting with random weight initialisation followed by stochastic gradient descent in the model training). From this perspective any of these results sets are probably similarly valid. It also isn't clear there are really use cases for repeatedly running a query and comparing results.

I guess the counterargument is it might make debugging systems which include this component harder. However, this behaviour has been present for almost a year and no one has actually noticed or cared to report it.

@peter-strsr
Copy link

@tveasey I see your point, but there are real counter arguments why deterministic behavior is important.

  1. I know of people who have their integration tests failing, because of this non-deterministic behavior, which is frustrating for the development teams.

  2. I know of use cases where customers are using knn for e-commerce search and running concurrent requests for their aggregations, which are created according to some business logic. These aggregations are expected to match with the results that are being displayed to the user.
    I think for some use cases you can use post-filters and do everything in one request. But sometimes it even requires multiple steps. One thing I can think of is a percentiles aggregation and then a bucket aggregation to display price buckets to the customer, depending on the distribution of prices in the result set.

  3. End users might be confused why they are seeing different results, when executing a query multiple times. It's even more confusing when they see 5 results and I filter it by xyz (3) and have 4 results after that.

@tveasey
Copy link
Contributor

tveasey commented Jan 27, 2025

Thanks for the extra context @peter-strsr. I wanted to play devils advocate regarding how important this is to make sure we need to spend the effort (and potentially take a hit on performance). These do seem like valid considerations, although I would say part of it could be addressed by user education (assuming actually quality from run to run isn't worse). In any case I can see a case for an option, or even switching to deterministic approach if it can be made almost as fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

7 participants