Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES caps "total hits" at 10K? #166

Closed
philbudne opened this issue Nov 20, 2023 · 2 comments
Closed

ES caps "total hits" at 10K? #166

philbudne opened this issue Nov 20, 2023 · 2 comments

Comments

@philbudne
Copy link
Contributor

It seems "total hits" on a search is capped at 10K.

It seems like there was quite a cluster-truck around this topic, and there was at least one breaking change in the ES API regarding this.

https://opster.com/guides/elasticsearch/search-apis/elasticsearch-count-query/
says:

Note that if your index contains more than 10000 documents and you need an exact count, you need to include ”track_total_hits”: true as shown below (note that depending on your index size, this can be costly)

Current API documentation on track_total_hits (it's an int; no, it's a bool, NO! IT'S BOTH!!):
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#track-total-hits

A number of related github issues:

ES: "Do not compute hit counts by default #33028" elastic/elasticsearch#33028
is a VOLUMINOUS thread, starting with:

Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.

Kibana: "ES will eventually disable hit counts - affects APM UI #25862" elastic/kibana#25862

And ES PRs:

"Add rest_total_hits_as_int in the search APIs #35848" (7.x branch?) elastic/elasticsearch#35848
"Make hits.total an object in the search response #35849 (6.x branch?) " elastic/elasticsearch#35849

P.S.
The IA web_collection_search Dockerfile https://github.com/internetarchive/web_collection_search/blob/main/Dockerfile is wired to use >=7.0,<8.0 client code; I wonder if it's related?

@rahulbot
Copy link
Contributor

rahulbot commented Nov 20, 2023

Ugh. That's ridiculous (from a database user perspective). This is critical data we need to show users on every search result. Could we:
a. sum the attention-over-time data and use that to provide an estimated total? (Django front-end server could do this, or mediacloud-news-search library used by mc-providers)
b. turn on the expensive solution in staging and measure true impact on our data?

@philbudne
Copy link
Contributor Author

Since this seems to be an issue (or at least most easily solved) in news-search-api, I'm moving discussion to mediacloud/news-search-api#26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants