-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring searches with the same request for dense_vector exhibit consistency issues in the results. #119180
Comments
If the index is forcibly merged into a single segment with forcemerge, the results become stable again. |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
@gbanasiak @elasticsearchmachine |
I tested this on I got: |
after a few tests across different versions, it seems this non-determinism was introduced in 8.13 as part of apache/lucene#12962 . |
Yes, I have also tested that increasing num_candidates can alleviate the problem. However, our business involves data at the billion level, with 1024-dimensional vectors, 2T of storage, and continuous data ingestion. There are over a thousand segments. For knn + filter search, increasing num_candidates does indeed play a certain role in alleviating the issue, but the consistency problem still exists. In order to secure my year-end bonus, I had to add caching at the business level to ensure the consistency of search results in a short period of time. I believe that a search engine should ensure the consistency of results, and on this basis, further optimize performance and recall rate. |
this is a sensitive subject, and different folks might have slightly different opinions here. However apache/lucene#12962 introduces some non-determinism that depends on the concurrent execution of search threads, which is probably not possible to constrain. The other option we have is to see whether it's possible to enable/disable that. |
Agreed, we should make this better. Either by fixing the information sharing or better handling concurrency as a whole. We should also open a bug in Apache Lucene that more precisely describes the issue and we can work on it. |
There are ways to share information which leads to deterministic results at the expense of some synchronisation overhead. I'd written up some notes on this previously. One thing I always come back to on this sort of "problem" is does determinism actually matter? From a philosophical standpoint there are many somewhat random processes which lead to the exact order in which vectors match a query (starting with random weight initialisation followed by stochastic gradient descent in the model training). From this perspective any of these results sets are probably similarly valid. It also isn't clear there are really use cases for repeatedly running a query and comparing results. I guess the counterargument is it might make debugging systems which include this component harder. However, this behaviour has been present for almost a year and no one has actually noticed or cared to report it. |
@tveasey I see your point, but there are real counter arguments why deterministic behavior is important.
|
Thanks for the extra context @peter-strsr. I wanted to play devils advocate regarding how important this is to make sure we need to spend the effort (and potentially take a hit on performance). These do seem like valid considerations, although I would say part of it could be addressed by user education (assuming actually quality from run to run isn't worse). In any case I can see a case for an option, or even switching to deterministic approach if it can be made almost as fast. |
Elasticsearch Version
8.17.0
Installed Plugins
No response
Java Version
openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing)
OS Version
Linux debian-002 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
Problem Description
In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results.
This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0.
Steps to Reproduce
Here are the steps to reproduce the issue:
Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem.
The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%.
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: