-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new auto_histogram aggregation for numeric fields #31828
Comments
Pinging @elastic/es-search-aggs |
Here's a suggestion, what do you think? It's easier for the user if buckets are round numbers, eg:
is easier to parse than:
So I propose:
Examples:
|
@melissachang what you are describing about avoiding "non-round" intervals is actually what we intend to do and what is already done for the date version of this aggregation in #28993. The difference is that the intervals are always selected to be "round". the This means that an application can, for example, provide the maximum number of bars that it has space to render on a bar chart, and the aggregation will return buckets with an interval easy for humans to parse but which do not exceed the number the application is able to render. |
I see, thanks. The documentation for #28993 says "The buckets field is optional, and will default to 10 buckets if not specified.". Does "rounding" occur in this case? For me, ideal is:
|
Yes rounding also occurs for the default value, so you may get less than 10 buckets returned by default but not more than 10. |
Thanks. Is there an ETA on when this will be implemented? I would like to start using it. :) |
We don't have an ETA for this sorry but its great that you are excited for this feature. You can track this issue for progress on the feature, where there is a PR available it will show the intended version(s) (note until the PR is merged the intended versions may change) |
Thanks. If I could test the future PR before it's merged, that would be great. I'd like to see how it looks on my data. |
Presumably this would also follow One (very) edge case worth considering is what happens when Finally, could you confirm that your rounding logic will always round to 1x, 2x or 5x a power of ten? I'm building a manual equivalent of this feature at the moment, using a separate calibration request, and am trying to design it in such a a way that we could swap it out for |
Hi @mrec, thanks for your comments! We'll follow the same policy of erroring if number of buckets required exceeds the soft limit, but unfortunately, the logic is slightly more complex than throwing an error if buckets exceed 10k; the logic is here. The reason for checking and throwing this error is that our rounding could otherwise trip the soft limit for max number of buckets. So, the logic is actually to error if Currently, we don't error if On rounding logic: We round in milliseconds and the current finest grained rounding interval is |
Hi @pcsanwald,
That's fine; I understand the rationale and was already allowing for rounding in my calibrated implementation.
Erm... I think you're getting mixed up between |
@mrec my mistake, sorry. Your suggestion seems sensible and I can't currently think of a reason we wouldn't round to those intervals, at this point. Since I'm planning to start on this work soon, I'll keep the issue updated and I'd be happy to discuss any cases that might cause us not to round in this manner. |
For numeric auto_histogram I was thinking of rounding to 1x, 2x, 5x powers of ten but this isn't set in stone at this stage. We need to begin implementing and then we can assess whether those multiples are appropriate and give enough options within a power of ten. Hopefully what we decide and what you decide will end up the same and you can swap your implementation out transparently but we can't guarantee it at this stage |
@colings86 - understood. I'm actively expecting ES upgrades to change our bucketing, for the better; at the moment we're stuck way back on 2.3.2 where |
sure, I understand that aim and I think its a good one. We'll be sure to update this issue when something is settled on. 😄 |
Thinking about this a bit more since yesterday. @pcsanwald : seeing what Semi-offtopic: extreme outliers aren't the only cause of the "won't fit" problem, but in our usage of |
It seems like this ticket would focus on picking buckets automatically but make them of equal ranges? If so it would not satisfy the main use cases of elastic/kibana#3905 and elastic/kibana#3757 which were asking for quantile / percentile based ranges on the X-Axis, right? Said differently, the ranges should not be equal (e.g. 1-9, 10-499, 500-10000) but instead represent an equivalent count of entries within each. This is really powerful as it lets you graph some metric for your bottom 25% of users vs top 25% of users. In typical performance histogram case, there is a huge long tail of results and having simple auto-bucketing would just cram 99% of results into only 1 bucket. As I understand @jpountz the only way to do this today would be the 2 queries approach? |
Correct. There is a related open PR (from a community member) to add variable-width histograms: #42035 But I don't think that would satisfy the requirement if you need exact control over the bucket-widths. E.g. 42035 will still choose it's own bucket widths as it clusters/merges buckets together. On the other hand, if you are going after the CDF in particular, I'd probably just use the percentiles aggregation and specify a bunch of percentiles to retrieve (0-100 in increments of 1 or 0.1 or whatever). Calculating those is essentially free since they are all being generated from the same data sketch. You could do the same with a percentile_rank to get the actual values on the x-axis, although it'd require a Edit: Oh, nevermind, you want to buckets so that you can run sub-aggs right? Yeah that would require two passes if you need to specify histogram widths like for percentiles/quantiles. Theoretically we could do it in a single-pass approximate manner (by interrogating the sketch at runtime as it accumulates data)... but I think the accuracy of that would be awful. |
any progress? |
Any news on this one? As already mentioned, it would be really handy for commerce applications (price ranges). |
Isn't this available now as Variable width histogram aggregation? |
Per #28993, we also want to add support for an auto_histogram on numeric fields. We agreed that it should be a separate issue so that we would not block #28993 on implementation.
The text was updated successfully, but these errors were encountered: