-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize date_histograms across daylight savings time #55559
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Thanks for the reviews @danielmitterdorfer and @not-napoleon! With @DaveCTurner's faith in the tests I'm going to consider this one approved. The tests for this stuff really are quite broad. I will make the final changes to this to prepare it for merging and inclusion into 7.9.0. |
Branch cut day is always rough. |
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
…astic#55559) Rounding dates on a shard that contains a daylight savings time transition is currently something like 1400% slower than when a shard contains dates only on one side of the DST transition. And it makes a ton of short lived garbage. This replaces that implementation with one that benchmarks to having around 30% overhead instead of the 1400%. And it doesn't generate any garbage per search hit. Some background: There are two ways to round in ES: * Round to the nearest time unit (Day/Hour/Week/Month/etc) * Round to the nearest time *interval* (3 days/2 weeks/etc) I'm only optimizing the first one in this change and plan to do the second in a follow up. It turns out that rounding to the nearest unit really *is* two problems: when the unit rounds to midnight (day/week/month/year) and when it doesn't (hour/minute/second). Rounding to midnight is consistently about 25% faster and rounding to individual hour or minutes. This optimization relies on being able to *usually* figure out what the minimum and maximum dates are on the shard. This is similar to an existing optimization where we rewrite time zones that aren't fixed (think America/New_York and its daylight savings time transitions) into fixed time zones so long as there isn't a daylight savings time transition on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement time interval rounding the time zone rewriting optimization *should* no longer be needed. This optimization doesn't come into play for `composite` or `auto_date_histogram` aggs because neither have been migrated to the new `DATE` `ValuesSourceType` which is where that range lookup happens. When they are they will be able to pick up the optimization without much work. I expect this to be substantial for `auto_date_histogram` but less so for `composite` because it deals with fewer values. Note: My 30% overhead figure comes from small numbers of daylight savings time transitions. That overhead gets higher when there are more transitions in logarithmic fashion. When there are two thousand years worth of transitions my algorithm ends up being 250% slower than rounding without a time zone, but java time is 47000% slower at that point, allocating memory as fast as it possibly can.
nik9000
added a commit
that referenced
this pull request
May 7, 2020
…5559) (#56334) Rounding dates on a shard that contains a daylight savings time transition is currently something like 1400% slower than when a shard contains dates only on one side of the DST transition. And it makes a ton of short lived garbage. This replaces that implementation with one that benchmarks to having around 30% overhead instead of the 1400%. And it doesn't generate any garbage per search hit. Some background: There are two ways to round in ES: * Round to the nearest time unit (Day/Hour/Week/Month/etc) * Round to the nearest time *interval* (3 days/2 weeks/etc) I'm only optimizing the first one in this change and plan to do the second in a follow up. It turns out that rounding to the nearest unit really *is* two problems: when the unit rounds to midnight (day/week/month/year) and when it doesn't (hour/minute/second). Rounding to midnight is consistently about 25% faster and rounding to individual hour or minutes. This optimization relies on being able to *usually* figure out what the minimum and maximum dates are on the shard. This is similar to an existing optimization where we rewrite time zones that aren't fixed (think America/New_York and its daylight savings time transitions) into fixed time zones so long as there isn't a daylight savings time transition on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement time interval rounding the time zone rewriting optimization *should* no longer be needed. This optimization doesn't come into play for `composite` or `auto_date_histogram` aggs because neither have been migrated to the new `DATE` `ValuesSourceType` which is where that range lookup happens. When they are they will be able to pick up the optimization without much work. I expect this to be substantial for `auto_date_histogram` but less so for `composite` because it deals with fewer values. Note: My 30% overhead figure comes from small numbers of daylight savings time transitions. That overhead gets higher when there are more transitions in logarithmic fashion. When there are two thousand years worth of transitions my algorithm ends up being 250% slower than rounding without a time zone, but java time is 47000% slower at that point, allocating memory as fast as it possibly can.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In elastic#55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in elastic#55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In #55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 7, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In elastic#55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
that referenced
this pull request
May 8, 2020
When an index spans a daylight savings time transition we can't use our optimization that rewrites the requested time zone to a fixed time zone and instead we used to fall back to a java.util.time based rounding implementation. In #55559 we optimized "time unit" rounding. This optimizes "time interval" rounding. The java.util.time based implementation is about 1650% slower than the rounding implementation for a fixed time zone. This replaces it with a similar optimization that is only about 30% slower than the fixed time zone. The java.util.time implementation allocates a ton of short lived objects but the optimized implementation doesn't. So it *might* end up being faster than the microbenchmarks imply.
nik9000
added a commit
that referenced
this pull request
May 9, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in #55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
to nik9000/elasticsearch
that referenced
this pull request
May 9, 2020
This wires `auto_date_histogram` into the rounding optimization that I built in elastic#55559. This is should significantly speed up any `auto_date_histogram`s with `time_zone`s on them.
nik9000
added a commit
that referenced
this pull request
May 9, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Analytics/Aggregations
Aggregations
>enhancement
release highlight
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
v7.9.0
v8.0.0-alpha1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rounding dates on a shard that contains a daylight savings time transition
is currently something like 1400% slower than when a shard contains dates
only on one side of the DST transition. And it makes a ton of short lived
garbage. This replaces that implementation with one that benchmarks to
having around 30% overhead instead of the 1400%. And it doesn't generate
any garbage per search hit.
Some background:
There are two ways to round in ES:
I'm only optimizing the first one in this change and plan to do the second
in a follow up. It turns out that rounding to the nearest unit really is
two problems: when the unit rounds to midnight (day/week/month/year) and
when it doesn't (hour/minute/second). Rounding to midnight is consistently
about 25% faster and rounding to individual hour or minutes.
This optimization relies on being able to usually figure out what the
minimum and maximum dates are on the shard. This is similar to an existing
optimization where we rewrite time zones that aren't fixed
(think America/New_York and its daylight savings time transitions) into
fixed time zones so long as there isn't a daylight savings time transition
on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement
time interval rounding the time zone rewriting optimization should no
longer be needed.
This optimization doesn't come into play for
composite
orauto_date_histogram
aggs because neither have been migrated to the newDATE
ValuesSourceType
which is where that range lookup happens. Whenthey are they will be able to pick up the optimization without much work.
I expect this to be substantial for
auto_date_histogram
but less so forcomposite
because it deals with fewer values.Note: My 30% overhead figure comes from small numbers of daylight savings
time transitions. That overhead gets higher when there are more
transitions in logarithmic fashion. When there are two thousand years
worth of transitions my algorithm ends up being 250% slower than rounding
without a time zone, but java time is 47000% slower at that point,
allocating memory as fast as it possibly can.