-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add top_hits aggregation #6124
Add top_hits aggregation #6124
Conversation
This looks good to me. Let's add some documentation and tests and I think that will be it. |
It's so great to have this aggregation feature with top_hits, but how to use it? Wait for elasticsearch:master accept the merge request and update? Or clone your branch and compile, import in Maven? Thanks in advance for your future response. |
Beside using JSON api to parse, do you have any Java or Scala APIs to retrieve or iterate bucket aggregation results, for example, in "key": "osx", how to get "title": "How to Install Google Chrome from the command line", "title": "All Mac OS X apps crash as opened" and "title": "Create a shortcut for application on Google Chrome for MacOSX"? Thanks! |
@yao23 the feature, when it gets in, will be on the 1.3 release (we still have a 1.2 release that will happen hopefully soonish). I would not build this now, wait till it gets into master + 1.x branch, and then if you are eager to try it out, you can build the 1.x branch release once its in. Regarding the API, there is a full Java client API as part of Elasticsearch, how to access aggregations using it is best asked on the mailing list. |
@kimchy Appreciate for your immediate response, I will try to build it and use Java APIs to access aggregations buckets content and post results here after experiment. Reference for other guys, link about Java APIs: http://stackoverflow.com/questions/21018493/how-to-access-aggregations-result-with-elasticsearch-java-api-in-searchresponse |
@jpountz I added tests and documentation. |
* {ref}/search-request-source-filtering.html[Source filtering] | ||
* {ref}/search-request-script-fields.html[Script fields] | ||
* {ref}/search-request-fielddata-fields.html[Fielddata fields] | ||
* {ref}/search-request-version.html[Include versions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
@martijnvg this looks great. I left some minor comments about the documentation but other than that I'm good with pushing this change! |
/** | ||
* | ||
*/ | ||
@ElasticsearchIntegrationTest.SuiteScopeTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this suite scoped? I don't see where this test modifies the cluster neither does it need any specific node level settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All agg tests are suite scoped, so that is why I made this test suite scoped as well.
Fail if sub aggs are specified Updated docs
thanks @martijnvg LGTM |
LGTM |
…cument being aggregated per bucket. Closes #6124
Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released. Also, what would be a likely release date of 1.3.0? |
You can always compile from source. See https://github.com/elasticsearch/elasticsearch/blob/master/README.textile
Shortly before 1.4.0 ;) It'll be released when it is ready. |
Thanks for the top_hits feature in 1.3 |
You should not try to get all documents, this would blow up CPU and memory on your cluster. |
My use case is like this. I store product data {
"_id": "product_id",
"group": "A",
"page_views": 1000,
"field_x": "123abc",
"field_y": "1010zzz"
} I want to do terms bucket on "group" and get ALL the products, present under each bucket, in the descending order of their "page_views". I would use this result for further calculation. Query:GET /my_idx/my_type/_search
{
"size": 0,
"aggs": {
"product_group": {
"terms": {
"field": "group",
"size": 0
},
"aggs": {
"top_products": {
"top_hits": {
"sort": [
{
"page_views": {
"order": "desc"
}
}
],
"_source": {
"include": [
"page_views", "field_x", "field_y"
]
},
"size": 1000000 //this size is not known
}
}
}
}
}
} Please let me know if there is an alternate way to accomplish this. |
The only reasonable way to do it would be to first start a request to compute the top groups and then one request per group (with a filter) using scroll for pagination. |
@jpountz thank you. Will try your approach. |
The
top_hits
aggregator keeps track of the most relevant document being aggregated. This aggregator should be used as a sub aggregator of a bucket based aggregator, so that the top documents per bucket are computed.Via this aggregator grouping / field collapsing can be achieved and is very versatile. Someone can group by a field (using a terms aggregator as parent) or by time (using a histogram aggregator as parent), in any case the parent bucket aggregator determines how to group. How correct the top hits will depend on the parent aggregator. For example when using the
terms
aggregator and thetop_hits
aggregator some document may not end up in the response, because theshard_size
on theterms
aggregator is less then the field's cardinality.The
top_hits
aggregator should have the following options:size
- The amount of hits to collect.sort
- Defines how the top hits should be sorted.The prototype that is attached right now to this PR integrates nicely with the fetch phase, which allows all fetch like features to be implemented easily. Also it executes as if the
search_type
is set toquery_and_fetch
, this way aggregations don't need to execute extra round trips.Example usage of the current prototype:
In this example the hits are sorted by the field
last_activity_date
and only the top 3 hits are returned. Also per hit only thetitle
field is included.Response: