-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time_zone
option makes date histograms much slower
#28727
Comments
Joda does really quite a lot of work in trying to find the previous time zone transition, which I'd guess to be the expensive bit (and it appears on the stack trace in the investigation you were working on):
Although I'm sure it'd be possible to put a cache around Joda, given that we're working on #27330 I think it would be a good idea to postpone any in-depth work on this until we can see what the effects of that would be. |
This is good to know, thanks for sharing. I suspect that the |
Sounds like you're volunteering! |
Did something break/changed between 5.X and 6.X ? I run three clusters in 5.1, 5.6.8 and 6.1 and DateHistogram aggregations are basically unusable for me with the 6.X cluster. When I put the same data in 5.X clusters, I don't have any performance issues. |
I don't think this is the right place to comment about this. If you can make a bash script that reproduces the issue against a clean cluster I'd file it as a separate issue. If you can't I'd take it to http://discuss.elastic.co/ . |
Since this pops up rather often, I wanted to add another possible performance improvement suggestion. Before starting aggregating the date histogram, we could actually check if the query had an overall date range filter applied. We could than check on the start and end date in that date range. If both of these dates are within the same daylight saving time period, we could actually use the offset this time zone had during that period as an absolute fixed time zone (e.g. if I am just doing a date histogram with the timezone This would of course not solve the performance issue, when doing a date histogram over a period of time, that contained a DST switch, but already would improve for a lot of users, usually looking at smaller date ranges. See also elastic/kibana#18853 for a detailed meta issue on the Kibana side. |
I agree this is a good idea. Unfortunately today queries and aggregations are kept completely unaware of each other so this would be hard to implement without adding unwanted dependencies. Something less efficient than your proposal but that should cover a number of cases already would be to look at the min/max values that exist within the current shard and apply the optimization that you describe if all times within this interval have the same offset. With eg. daily indices, this optimization would still apply in most cases. |
Date histograms on non-fixed timezones such as `Europe/Paris` proved much slower than histograms on fixed timezones in elastic#28727. This change mitigates the issue by using a fixed time zone instead when shard data doesn't cross a transition so that all timestamps share the same fixed offset. This should be a common case with daily indices. NOTE: Rewriting the aggregation doesn't work since the timezone is then also used on the coordinating node to create empty buckets, which might be out of the range of data that exists on the shard. NOTE: In order to be able to get a shard context in the tests, I reused code from the base query test case by creating a new parent test case for both queries and aggregations: `AbstractBuilderTestCase`. Mitigates elastic#28727
Date histograms on non-fixed timezones such as `Europe/Paris` proved much slower than histograms on fixed timezones in #28727. This change mitigates the issue by using a fixed time zone instead when shard data doesn't cross a transition so that all timestamps share the same fixed offset. This should be a common case with daily indices. NOTE: Rewriting the aggregation doesn't work since the timezone is then also used on the coordinating node to create empty buckets, which might be out of the range of data that exists on the shard. NOTE: In order to be able to get a shard context in the tests, I reused code from the base query test case by creating a new parent test case for both queries and aggregations: `AbstractBuilderTestCase`. Mitigates #28727
Date histograms on non-fixed timezones such as `Europe/Paris` proved much slower than histograms on fixed timezones in #28727. This change mitigates the issue by using a fixed time zone instead when shard data doesn't cross a transition so that all timestamps share the same fixed offset. This should be a common case with daily indices. NOTE: Rewriting the aggregation doesn't work since the timezone is then also used on the coordinating node to create empty buckets, which might be out of the range of data that exists on the shard. NOTE: In order to be able to get a shard context in the tests, I reused code from the base query test case by creating a new parent test case for both queries and aggregations: `AbstractBuilderTestCase`. Mitigates #28727
…30534) Date histograms on non-fixed timezones such as `Europe/Paris` proved much slower than histograms on fixed timezones in elastic#28727. This change mitigates the issue by using a fixed time zone instead when shard data doesn't cross a transition so that all timestamps share the same fixed offset. This should be a common case with daily indices. NOTE: Rewriting the aggregation doesn't work since the timezone is then also used on the coordinating node to create empty buckets, which might be out of the range of data that exists on the shard. NOTE: In order to be able to get a shard context in the tests, I reused code from the base query test case by creating a new parent test case for both queries and aggregations: `AbstractBuilderTestCase`. Mitigates elastic#28727
It is partial, but I think it's good enough to close this issue. Thanks for the ping. |
I was looking at a slow query where removing the
timezone
option made the query 4x faster: 8s on average without thetime_zone
parameter and 32s on average with atime_zone
.This query filters one week of data in February with
Europe/Berlin
as a timezone (so all documents are on the same side of the daylight saving time boundary) and there are more than 1B matches.Can we speed this up?
For the record this is less of an issue for timezones that do not implement daylight savings, so users might want to consider switching to
Etc/GMT-1
instead ofEurope/Berlin
if that works for them.cc @elastic/es-search-aggs
The text was updated successfully, but these errors were encountered: