-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a new auto-interval date histogram #28993
Conversation
This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes #9572
Pinging @elastic/es-search-aggs |
This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result
This reverts commit 993c782.
@jpountz there are tests that need to be added to this PR and @pcsanwald is looking into benchmarking the aggregation at the moment to make sure the time performance is ok but I think its worth you taking a look so if there are any major things to change we can fix them now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look and think the general approach is good. It would help to better document how the target number of buckets is interpreted and how far the actual number of buckets may be.
Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets
This reverts commit d88d764.
* Adds a new auto-interval date histogram This change adds a new type of histogram aggregation called `auto_date_histogram` where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time. This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in `breadth_first` mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary. Closes elastic#9572 * Adds documentation * Added sub aggregator test * Fixes failing docs test * Brings branch up to date with master changes * trying to get tests to pass again * Fixes multiBucketConsumer accounting * Collects more buckets than needed on shards This gives us more options at reduce time in terms of how we do the final merge of the buckeets to produce the final result * Revert "Collects more buckets than needed on shards" This reverts commit 993c782. * Adds ability to merge within a rounding * Fixes nonn-timezone doc test failure * Fix time zone tests * iterates on tests * Adds test case and documentation changes Added some notes in the documentation about the intervals that can bbe returned. Also added a test case that utilises the merging of conseecutive buckets * Fixes performance bug The bug meant that getAppropriate rounding look a huge amount of time if the range of the data was large but also sparsely populated. In these situations the rounding would be very low so iterating through the rounding values from the min key to the max keey look a long time (~120 seconds in one test). The solution is to add a rough estimate first which chooses the rounding based just on the long values of the min and max keeys alone but selects the rounding one lower than the one it thinks is appropriate so the accurate method can choose the final rounding taking into account the fact that intervals are not always fixed length. Thee commit also adds more tests * Changes to only do complex reduction on final reduce * merge latest with master * correct tests and add a new test case for 10k buckets * refactor to perform bucket number check in innerBuild * correctly derive bucket setting, update tests to increase bucket threshold * fix checkstyle * address code review comments * add documentation for default buckets * fix typo
* es/6.x: Use correct formatting for links (#29460) Revert "Adds a new auto-interval date histogram (#28993)" Revert "fix typo" fix typo Adds a new auto-interval date histogram (#28993) [Rollup] Replace RollupIT with a ESRestTestCase version (#31977) [Rollup] Fix duplicate field names in test (#32075) [Tests] Fix failure due to changes exception message (#32036) [Test] Mute MlJobIT#testDeleteJobAfterMissingAliases Replace Ingest ScriptContext with Custom Interface (#32003) (#32060) Cleanup Duplication in `PainlessScriptEngine` (#31991) (#32061) HLRC: Add xpack usage api (#31975) Clean Up Snapshot Create Rest API (#31779)
Apologies for the necropost, but I'm increasingly unsure that I've understood the implications of "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m", mostly because I can't see where Also, the agg docs don't make it clear what happens if the requested |
@mrec The The idea of the soft limits is that we have controls that aim to stop a single user performing a request that would destabilise or be otherwise detrimental to the cluster or would adversely affect other users trying to perform request on the cluster. Most of these soft limits end up effectively limiting the amount of memory a user's request can use in various ways but some do other things. The idea with the "Fail the request at parsing time if the number of buckets requested is at higher than soft_limit_value / m" is that if at the point we parse the request we can already determine that we are going to need to produce more buckets than this particular soft limit allows there is not point in us executing the request since during the execution it will definitely trip the soft limit and fail. So instead we fail the request at parsing time. Thats not to say that the soft limit won't be tripped if it passes the If the request cannot be satisfied even with 100 year intervals then the request will fail on the soft limit as will an other aggregation request. I'd like to note two things here though:
|
@colings86 I think you're confirming my worries - a maximum range of 33,000 years is more than generous, but I think a (default) maximum |
I don't agree here. Given that the final interval is out of the users hands I don't think this aggregation is useful for analytics jobs which are consuming a lot of buckets (since they would almost certainly want to determine the interval of the buckets themselves) but rather for visualising data to the user (e.g. on a bar chart). Given this I find it hard to believe that a visualisation could provide the information in a way that a user could consume if its rendering 100s of buckets. Is your use case intending to present the out put of
We do document the soft limit here. Note that the documentation says its disabled by default at the moment but it will be set to a default of 10,000 in 7.0. The optimisation we added in this aggregation doesn't change the behaviour or lower the limit, it just catches the violation of the soft limit earlier before we start executing the request. As I mentioned before, we feel that the soft limits are important tools for cluster admins to be able to add confidence that a single request is not going to adversely affect the cluster or other user's requests and the limits can always be adjusted if the admin determines its appropriate to increase it, but then the decision can be made with the knowledge that the cluster is safe to handle these more expensive requests |
Sorry, I think that came across more argumentative than intended. Our main expected use cases are around visualizations for screening, yes, and examples I've seen of "the sort of thing we want" include pixel-per-bar charts around 400px wide. Whether that resolution is actually needed is open to debate; I'll talk to our PM. I do think it'd be worth calling out the 333 limit explicitly in the agg docs - the connection between that and the documented 10k limit is not obvious, and most users won't be reading this PR thread. |
No offence taken at all 😄
You have convinced me on this. I think we should point this out in the documentation. Thanks for persevering with me |
…arch#28993 (#3521) (cherry picked from commit b4e8bd3)
…arch#28993 (#3521) (cherry picked from commit b4e8bd3)
This change adds a new type of histogram aggregation called
auto_date_histogram
where you can specify the target number of buckets you require and it will find an appropriate interval for the returned buckets. The aggregation works by first collecting documents in buckets at second interval, when it has created more than the target number of buckets it merges these buckets into minute interval bucket and continues collecting until it reaches the target number of buckets again. It will keep merging buckets when it exceeds the target until either collection is finished or the highest interval (currently years) is reached. A similar process happens at reduce time.This aggregation intentionally does not support min_doc_count, offest and extended_bounds to keep the already complex logic from becoming more complex. The aggregation accepts sub-aggregations but will always operate in
breadth_first
mode deferring the computation of sub-aggregations until the final buckets from the shard are known. min_doc_count is effectively hard-coded to zero meaning that we will insert empty buckets where necessary.This is also the first aggregation to merge buckets at collection time but having this aggregation will open the door to other such aggregations that merge buckets at collection time.
Things still to do:
10 * target
buckets on the shards and then use the extra bucket to create buckets representing multiples of the interval in the cases where we have too many buckets for one interval but if we increase the interval we get only 1 (or a few) bucket(s) (e.g. if the target buckets is 10 and we end up with 30 x minute level buckets if we increase the rounding we'll get 1 x hour bucket so instead can we merge every 3 minute level buckets to get 10 x 3 minute buckets?) - this will probably be moved out into a separate issue to be tackled when this first pass is mergedauto_histogram
aggregation to work on numeric (rather than date) fields - this should be moved into a separate issueCloses #9572