-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not compute hit counts by default #33028
Comments
Pinging @elastic/es-search-aggs |
Another question that popped up is whether we should return "precise" lower bounds or just indicate the lower lower bound. If |
I don't like changing the meaning of |
I'm also not a fan of the options where the meaning of It's also a bit more graceful upgrade path. UIs might be showing a bad value ( |
Also prefer an additional property, On the request side can we create a new property e.g |
We had a long discussion about this issue, here are some arguments that have been made:
Even though we couldn't make a decision, we narrowed down to two main paths: A. Keep same format
One modified version of this plan that has been discussed is to not return This path has the downside of making the transition easier but the response is less explicit: you need to know the threshold that was set in the request in order to know how to interpret B. Make
This path requires more changes on client side and makes it harder to perform generic processing of search responses due to the fact that the format of |
Just want to confirm my understanding of the latest plan: Kibana will not have to do anything for 7.0 and everything will continue to work the same except we may see deprecation warnings. If that changes (e.g. the decision to switch the default in 7.0 changes), or is incorrect, please let me know! |
Sorry I thought that I commented in this issue but for some reasons my comment is not there anymore. |
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested. It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the `track_total_hits` search option. A boolean value (`true`, `false`) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough hits have been collected. In order to ensure that the result is correctly interpreted this commit also adds a new section in the search response that indicates the number of tracked hits and whether the value is a lower bound (`gte`) or the exact count (`eq`): ``` GET /_search { "track_total_hits": 100, "query": { "term": { "title": "fast" } } } ``` ... will return: ``` { "_shards": ... "hits" : { "total" : -1, "tracked_total": { "value": 100, "relation": "gte" }, "max_score" : 0.42, "hits" : [] } } ``` Relates elastic#33028
I opened #35291 to discuss another alternative that doesn't require to change the
I prefer this option because it doesn't require a breaking change in the search response format so users that want exact count (or no count at all) are not impacted and we don't need to handle backward compatibility. |
If the current thinking is to disable Raising this only to make sure we don't phrasing this as backwards compatible. I can get behind both options (an additional field or total as an object) as long as we feel its the cleanest option. I personally much prefer the object approach since we can just have one location for the count and an additional field to indicate how to interpret this value. |
I was just talking about the format of the response no matter if you disable hit counts or not. But I agree that disabling total hit count by default is a breaking change ;).
Personally I prefer option A where we keep the same format. The only scenario where an object is needed is when a user explicitly sets the |
I prefer the object approach because its explicit. Typed languages will get a heads up on the change in defaults. Adittionally you can write pagination over We need a tiebreaker on this one I reckon :) |
We had a discussion internally and we've decided to make
This is inlined with option B above. The revised plan to make this change in 7.0 is the following:
We'll use API versioning to opt out from the new behaviors (format change and default value to track total hits). Official clients in 6x and 7 will send their version in the request automatically which will allow clients in 6x to search in a cluster with mixed versions (6x and 7). |
Great to see that we're changing to the object!
Can you provide more info how that will specifically work, @jimczi ? Does it only mean that a 6.x client will send a different header than a 7.x client, and it's the reponsibility of the user to use two separate clients? How would a 6.x client "know" which header to send otherwise? |
A 6.x client should send an header indicating its version (6) so that nodes in the cluster can adapt their defaults and responses to the version. Clients are forward compatible so during a rolling upgrade to 7 users can still use a 6x client and with this mechanism they will not see the breaking change until they upgrade their client to 7x. There are still open questions regarding this versioning but they should be tracked on the main issue. |
This commit changes the format of the `hits.total` in the search response to be an object with a `value` and a `relation`. The `value` indicates the number of hits that match the query and the `relation` indicates whether the number is accurate (in which case the relation is equals to `eq`) or a lower bound of the total (in which case it is equals to `gte`). This change also adds a parameter called `rest_total_hits_as_int` that can be used in the search APIs to opt out from this change (retrieve the total hits as a number in the rest response). Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain `hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a follow up (to allow numbers to be passed to `track_total_hits`). Relates elastic#33028
The support for rest_total_hits_as_int has already been merged to 6x in elastic#35848 so this change adds this new option to master. The plan was to add this new option as part of elastic#35848 but we've decided to wait a few days before merging this breaking change so this commit just handles the new option as a noop exactly like 6x for now. This will allow users to migrate to this parameter before elastic#35848 is merged. Relates elastic#33028
The support for rest_total_hits_as_int has already been merged to 6x in #35848 so this change adds this new option to master. The plan was to add this new option as part of #35848 but we've decided to wait a few days before merging this breaking change so this commit just handles the new option as a noop exactly like 6x for now. This will allow users to migrate to this parameter before #35848 is merged. Relates #33028
This commit changes the format of the `hits.total` in the search response to be an object with a `value` and a `relation`. The `value` indicates the number of hits that match the query and the `relation` indicates whether the number is accurate (in which case the relation is equals to `eq`) or a lower bound of the total (in which case it is equals to `gte`). This change also adds a parameter called `rest_total_hits_as_int` that can be used in the search APIs to opt out from this change (retrieve the total hits as a number in the rest response). Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain `hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a follow up (to allow numbers to be passed to `track_total_hits`). Relates #33028
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested. It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected. Relates elastic#33028
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested. It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the track_total_hits search option. A boolean value (true, false) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough matches have been collected. Relates #33028
This commit changes the default for the `track_total_hits` option of the search request to `10,000`. This means that by default search requests will accurately track the total hit count up to `10,000` documents, requests that match more than this value will set the `"total.relation"` to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response. Scroll queries are not impacted, they will continue to count the total hits accurately. The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request. I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate. Closes elastic#33028
This commit changes the default for the `track_total_hits` option of the search request to `10,000`. This means that by default search requests will accurately track the total hit count up to `10,000` documents, requests that match more than this value will set the `"total.relation"` to `"gte"` (e.g. greater than or equals) and the `"total.value"` to `10,000` in the search response. Scroll queries are not impacted, they will continue to count the total hits accurately. The default is set back to `true` (accurate hit count) if `rest_total_hits_as_int` is set in the search request. I choose `10,000` as the default because that's also the number we use to limit pagination. This means that users will be able to know how far they can jump (up to 10,000) even if the total number of hits is not accurate. Closes #33028
whatever happened here. I like
But it does not look like this was ever implemented for the java es client? I am trying with the high level rest client in |
This was not implemented (we target the next major version for this) so the error is expected. Instead we introduced a new parameter called |
@jimczi thanks for the quick reply |
In 7.x, will adding the |
Context
Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.
Not returning a hit count at all is problematic for traditional search UIs with pagination: say that you want to display up to 5 pages with 10 hits per page, you need to know whether the hit count is between 0 and 10, between 11 and 20, between 21 and 30, between 31 and 40, or greater than 40 in order to know how many pages need to be displayed. In order to address this issue, Lucene takes a configurable threshold: if the hit count is less than this threshold, then you will get an accurate hit count, otherwise you will get a lower bound of the hit count.
We don't want to discuss backward compatibility for now, let's focus what we want to have eventually, and only then discuss how we get there to make the change easy to digest for users. That's fine if we need multiple steps and the whole change is only available in 8 rather than 7.
Response format
We need a way to tell users whether the hit count is accurate or a lower bound. Multiple ideas have been mentioned:
+
as a way to say that the hit count is a lower bound, eg.{ "hits": { "total": "1234+" } }
when the hit count is a lower bound and{ "hits": { "total": 1234 } }
like today otherwise{ "hits": { "total": 1234, "total_hits_relation": "gte" } }
hits.total
an object with a value and a relation, eg.{ "hits": { "total": { "value": 1234, "relation": "gte" } } }
hits.total
an object that has two possible keys but only one is ever set, eg.{ "hits": { "total": { "gte": 1234 } } }
or{ "hits": { "total": { "eq": 1234 } } }
hits.total
at all and return a different field if ie.hits.min
or some better name.When we discussed these options, we felt like 1 would make parsing more complicated, and with 2 it would be too easy to miss the fact that you need to look up another field in order to know how to interpret the hit count.
Implementation options
Option 1: Make
track_total_hits
take a numberWe already have a
track_total_hits
switch which we added for index sorting, but currently take a boolean. It would be easy to make it take a number instead that would be the minimum number of hits to count accurately.We could ease transition to the new response format by using the current format when
track_total_hits
is unset or set to a boolean, and the new format whentrack_total_hits
is set to a number, and then deprecate support for booleans.Users who want accurate hit counts could set a very high value for this parameter, we could potentially allow using a special value like
-1
as a way to mean "be accurate".Option 2: Hardcoded number of hits to count
Always count accurately up to
index.max_result_window
hits. If users need accurate hits, they will need to use an aggregation (we need to add such an aggregation that counts docs).Ping @elastic/es-clients to get opinions about the above thoughts, especially the response format.
The text was updated successfully, but these errors were encountered: