Skip to content

Commit

Permalink
Add the ability to set the number of hits to track accurately
Browse files Browse the repository at this point in the history
In Lucene 8 searches can skip non-competitive hits if the total hit count is not requested.
It is also possible to track the number of hits up to a certain threshold. This is a trade off to speed up searches
while still being able to know a lower bound of the total hit count. This change adds the ability to set this threshold directly in the `track_total_hits` search option. A boolean value (`true`, `false`) indicates whether the total hit count should be tracked in the response. When set as an integer this option allows to compute a lower bound of the total hits while preserving the ability to skip non-competitive hits when enough hits have been collected.
In order to ensure that the result is correctly interpreted this commit also adds a new section in the search response
that indicates the number of tracked hits and whether the value is a lower bound (`gte`)  or the exact count (`eq`):
```
GET /_search
{
    "track_total_hits": 100,
    "query": {
        "term": {
            "title": "fast"
        }
    }
}
```
... will return:
```
{
  "_shards": ...
   "hits" : {
      "total" : -1,
      "tracked_total": {
        "value": 100,
        "relation": "gte"
      },
      "max_score" : 0.42,
      "hits" : []
  }
}
```

Relates elastic#33028
  • Loading branch information
jimczi committed Nov 6, 2018
1 parent cac67f8 commit de77e61
Show file tree
Hide file tree
Showing 25 changed files with 515 additions and 109 deletions.
2 changes: 1 addition & 1 deletion docs/reference/query-dsl/feature-query.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ of the query.
Compared to using <<query-dsl-function-score-query,`function_score`>> or other
ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is set to `false`. Speedups may be
<<search-request-track-total-hits,`track_total_hits`>> is set to `false`. Speedups may be
spectacular.

Here is an example that indexes various features:
Expand Down
127 changes: 127 additions & 0 deletions docs/reference/search/request/track-total-hits.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
[[search-request-track-total-hits]]
=== Track total hits

The `track_total_hits` parameter allows you to configure the number of hits to
count accurately.
When set to `true` the search response will contain the total number of hits
that match the query:

[source,js]
--------------------------------------------------
GET /_search
{
"track_total_hits": true,
"query" : {
"match_all" : {}
}
}
--------------------------------------------------
// CONSOLE

\... returns:

[source,js]
--------------------------------------------------
{
"_shards": ...
"hits" : {
"total" : 2048, <1>
"max_score" : 1.0,
"hits" : []
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"total": 2048/"total": $body.hits.total/]

<1> The total number of hits that match the query.

If you don't need to track the total number of hits you can set this option
to `false`. In such case the total number of hits is unknown and the search
can efficiently skip non-competitive hits if the query is sorted by relevancy:

[source,js]
--------------------------------------------------
GET /_search
{
"track_total_hits": false,
"query": {
"term": {
"title": "fast"
}
}
}
--------------------------------------------------
// CONSOLE

\... returns:

[source,js]
--------------------------------------------------
{
"_shards": ...
"hits" : {
"total" : -1, <1>
"max_score" : 0.42,
"hits" : []
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"max_score": 0\.42/"max_score": $body.hits.max_score/]

<1> The total number of hits is unknown.

The total hit count can't be computed accurately without visiting all matches,
which is costly for queries that match lots of documents. Given that it is
often enough to have a lower bounds of the number of hits, such as
"there are more than 1000 hits", it is also possible to set `track_total_hits`
as an integer that represents the number of hits to count accurately. When this
option is set as a number the search response will contain a new section called
`tracked_total` that contains the number of tracked hits (`tracked_total.value`)
and a relation (`tracked_total.relation`) that indicates if the `value` is
accurate (`eq`) or a lower bound of the total hit count (`gte`):

[source,js]
--------------------------------------------------
GET /_search
{
"track_total_hits": 100,
"query": {
"term": {
"title": "fast"
}
}
}
--------------------------------------------------
// CONSOLE

\... returns:

[source,js]
--------------------------------------------------
{
"_shards": ...
"hits" : {
"total" : -1, <1>
"tracked_total": { <2>
"value": 100,
"relation": "gte"
},
"max_score" : 0.42,
"hits" : []
}
}
--------------------------------------------------
// TESTRESPONSE[s/"_shards": \.\.\./"_shards": "$body._shards",/]
// TESTRESPONSE[s/"max_score": 0\.42/"max_score": $body.hits.max_score/]
// TESTRESPONSE[s/"value": 100/"value": $body.hits.tracked_total.value/]
// TESTRESPONSE[s/"relation": "gte"/"relation": "$body.hits.tracked_total.relation"/]

<1> The total number of hits is unknown.
<2> There are at least (`gte`) 100 documents that match the query.

Search can also skip non-competitive hits if the query is sorted by
relevancy but the optimization kicks in only after collecting at least
$`track_total_hits` documents. This is a good trade off to speed up searches
if you don't need the accurate number of hits after a certain threshold.
5 changes: 4 additions & 1 deletion docs/reference/search/uri-request.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,11 @@ scores and return them as part of each hit.

|`track_total_hits` |Set to `false` in order to disable the tracking
of the total number of hits that match the query.
(see <<index-modules-index-sorting,_Index Sorting_>> for more details).
Defaults to true.
It also accepts an integer which in this case represents the number of hits
to count accurately.
(see the <<search-request-track-total-hits, request body>> documentation
for more details).

|`timeout` |A search timeout, bounding the search request to be executed
within the specified time value and bail with the hits accumulated up to
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
"Track total hits":

- skip:
version: " - 6.99.99"
reason: track_total_hits was introduced in 7.0.0

- do:
search:
index: test_1
track_total_hits: false

- match: { hits.total: -1 }
- is_false: "hits.tracked_total"

- do:
search:
index: test_1
track_total_hits: true

- match: { hits.total: 0 }
- is_false: "hits.tracked_total"

- do:
search:
index: test_1
track_total_hits: 10

- match: { hits.total: -1 }
- match: { hits.tracked_total.value: 0 }
- match: { hits.tracked_total.relation: "eq" }

- do:
index:
index: test_1
id: 1
body: {}

- do:
index:
index: test_1
id: 2
body: {}

- do:
index:
index: test_1
id: 3
body: {}

- do:
index:
index: test_1
id: 4
body: {}

- do:
indices.refresh: {}

- do:
search:
index: test_1

- match: { hits.total: 4 }

- do:
search:
index: test_1
track_total_hits: false

- match: { hits.total: -1 }
- is_false: "hits.tracked_total"

- do:
search:
index: test_1
track_total_hits: 10

- match: { hits.total: -1 }
- match: { hits.tracked_total.value: 4 }
- match: { hits.tracked_total.relation: "eq" }

- do:
search:
index: test_1
track_total_hits: 3

- match: { hits.total: -1 }
- match: { hits.tracked_total.value: 3 }
- match: { hits.tracked_total.relation: "gte" }
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import org.elasticsearch.search.SearchShardTarget;
import org.elasticsearch.search.internal.AliasFilter;
import org.elasticsearch.search.internal.InternalSearchResponse;
import org.elasticsearch.search.internal.SearchContext;
import org.elasticsearch.search.internal.ShardSearchTransportRequest;
import org.elasticsearch.transport.Transport;

Expand Down Expand Up @@ -113,8 +114,10 @@ public final void start() {
if (getNumShards() == 0) {
//no search shards to search on, bail with empty response
//(it happens with search across _all with no indices around and consistent with broadcast operations)
listener.onResponse(new SearchResponse(InternalSearchResponse.empty(), null, 0, 0, 0, buildTookInMillis(),
ShardSearchFailure.EMPTY_ARRAY, clusters));
int trackTotalHitsThreshold = request.source() != null ?
request.source().trackTotalHitsThreshold() : SearchContext.DEFAULT_TRACK_TOTAL_HITS;
listener.onResponse(new SearchResponse(InternalSearchResponse.empty(trackTotalHitsThreshold), null, 0, 0, 0,
buildTookInMillis(), ShardSearchFailure.EMPTY_ARRAY, clusters));
return;
}
executePhase(this);
Expand Down
Loading

0 comments on commit de77e61

Please sign in to comment.