-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg #29594
Conversation
This pipeline aggregation gives the user the ability to script functions that "move" across a window of data, instead of single data points. It is the scripted version of MovingAvg pipeline agg. Through custom script contexts, we expose a number of convenience methods: - windowMax - windowMin - windowSum - simpleMovAvg - linearMovAvg - ewmaMovAvg - holtMovAvg - holtWintersMovAvg The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
No need to c/p the code, we can basically share the movavg models with an easy path towards deprecation in the future.
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polyfractal I took a first pass on this and left some comments but I like the approach
moving_fn: | ||
buckets_path: "the_avg" | ||
window: 3 | ||
script: "MovingFunctions.windowMax(values)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should test that the other functions are exposed correctly too?
@@ -79,7 +79,7 @@ public static GapPolicy parse(String text, XContentLocation tokenLocation) { | |||
for (GapPolicy policy : values()) { | |||
validNames.add(policy.getName()); | |||
} | |||
throw new ParsingException(tokenLocation, "Invalid gap policy: [" + text + "], accepted values: " + validNames); | |||
throw new IllegalArgumentException("Invalid gap policy: [" + text + "], accepted values: " + validNames); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing the token location here, it seems like useful information to return to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was incompatible with the new parser style (I didnt think you could get token info with the new way of writing parsers), but while writing this comment I noticed how BucketSort did it:
PARSER.declareField(BucketSortPipelineAggregationBuilder::gapPolicy, p -> {
if (p.currentToken() == XContentParser.Token.VALUE_STRING) {
return GapPolicy.parse(p.text().toLowerCase(Locale.ROOT), p.getTokenLocation());
}
throw new IllegalArgumentException("Unsupported token [" + p.currentToken() + "]");
}
which I somehow missed before. I'll put it back to the way it was and using the above snippet :)
/** | ||
* Find the maximum value in a window of values | ||
*/ | ||
public static double windowMax(Collection<Double> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the JavaDocs can we qualify the behaviour if there are no values for all these functions? e.g. for this one it will return Double.NaN
*/ | ||
public static double linearMovAvg(Collection<Double> values) { | ||
double avg = 0; | ||
long totalWeight = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it correct that the total weight starts at 1 here? This means that if there are no values the function will return 0
which seems misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. OTOH, that's how the linear avg works today so it'd be "breaking" the current behavior (these functions were basically lifted directly from existing moving model code)
I think we should probably change it anyway, but wanted to raise that point.
avg = (v * alpha) + (avg * (1 - alpha)); | ||
} | ||
} | ||
return avg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will also return 0
if there are no values, is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto above
Review comments addressed, and added documentation + deprecation notices + more tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, mostly about documentation
etc. | ||
|
||
This is conceptually very similar to the <<search-aggregations-pipeline-movavg-aggregation, Moving Average>> pipeline aggregation, except | ||
it provides more functionality and better flexibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more functionality and better flexibility
-> I think we can just say more functionality
here rather than repeating?
|`buckets_path` |Path to the metric of interest (see <<buckets-path-syntax, `buckets_path` Syntax>> for more details |Required | | ||
|`window` |The size of window to "slide" across the histogram. |Optional |`5` | ||
|`script` |The script that should be executed on each window of data |Required | | ||
|`params` |User-defined extra parameters to be injected into the script |Optional | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed as the script
field itself can handle params?
// TEST[setup:sales] | ||
|
||
<1> A `date_histogram` named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals | ||
<2> A `sum` metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should say this could be any numeric metric as I don't think you could point a
moving_fn` at, for example, a geo_centroid agg?
<3> Finally, we specify a `moving_fn` aggregation which uses "the_sum" metric as its input. | ||
|
||
Moving averages are built by first specifying a `histogram` or `date_histogram` over a field. You can then optionally | ||
add normal metrics, such as a `sum`, inside of that histogram. Finally, the `moving_fn` is embedded inside the histogram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normal -> numeric ?
- `linearMovAvg()` | ||
- `ewmaMovAvg()` | ||
- `holtMovAvg()` | ||
- `holtWintersMovAvg()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be useful to add windowStdDev()
to this list too as I have seen some requests for moving standard deviations too. However I imagine given that it would be used together with moving average, it would need the simple, linear, ewma, holt and holtWinters forms to make sense with moving average, so maybe its too complex to add it in this PR?
@@ -84,7 +84,11 @@ public MovAvgModel clone() { | |||
|
|||
@Override | |||
public double next(Collection<Double> values) { | |||
return MovingFunctions.simpleMovAvg(values); | |||
double avg = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this sum
as its not actually the avg
? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
durp. :)
@@ -86,7 +86,16 @@ public MovAvgModel clone() { | |||
|
|||
@Override | |||
public double next(Collection<Double> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here explaining why this is duplicating functionality provided by MovingFunctions
? I think someone could try to deduplicate the code without this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, actually, this should be pulled back out. This was added when I thought it was a breaking change but it's not actually (since we don't allow null/NaN as a value because of gap_policy
) in regular mov avg.
Thanks for noticing, I'll revert this to pointing at the static MovingFunctions version.
*/ | ||
public static double windowSum(Collection<Double> values) { | ||
return values.stream().mapToDouble(Double::doubleValue).sum(); | ||
if (values.size() == 0) { | ||
return 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs change above it says:
If the window is empty, or all values are
null
/NaN
,0.0
is returned as the result.
This seems to not be the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? If the size is zero, it returns 0.0
. And in the mapToDouble()
, we only use the value if it's non-null and non-NaN (otherwise we return 0.0 for that position).
There's a test to verify this: https://github.com/elastic/elasticsearch/pull/29594/files/83778999a9aec1e0170efba0d4857100a927a770..c93d065b6ffd4e37089b4e251ad784380b75a942#diff-a35897b4e9c0a693a4515cab553a6427R73
private final String bucketsPathString; | ||
private String format = null; | ||
private GapPolicy gapPolicy = GapPolicy.SKIP; | ||
private int window; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is optional we should set the default here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc error. Window isn't optional, I'll make the fix in the documentation.
}; | ||
|
||
|
||
public MovFnPipelineAggregationBuilder(String name, String bucketsPath, Script script, int window) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the window
is optional we should not provide it in the constructor here but rather with a setter method.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script part looks pretty good. I left a couple suggestions.
* This class provides a custom script context for the Moving Function pipeline aggregation, | ||
* so that we can expose a number of pre-baked moving functions like min, max, movavg, etc | ||
*/ | ||
public abstract class MovFnScript { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the name not be so cryptic? MovingFunctionScript
?
* If all values are missing/null/NaN, the return value will be NaN. | ||
* The average is based on the count of non-null, non-NaN values. | ||
*/ | ||
public static double simpleMovAvg(Collection<Double> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call this unweightedAvg
? Is the MovAvg
suffix really necessary given this is within MovingFunctions
? Ditto for the methods below.
* Find the maximum value in a window of values. | ||
* If all values are missing/null/NaN, the return value will be NaN | ||
*/ | ||
public static double windowMax(Collection<Double> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of using DoubleStream as the type here? I think it could save boxing in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was ++ on this and got it all coded up, but realized this makes certain things impossible because streams can only be used once.
The standard deviation for example just accepts an avg and the values:
MovingFunctions.windowStdDev(values, MovingFunctions.simpleMovAvg(values))
Which won't work because the stream will get used up in the movavg, and throw an exception when we try to iterate on it again :(
More cleanup:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new names! I left a couple more suggestions.
} | ||
|
||
public static final String[] PARAMETERS = new String[] {"params", "values"}; | ||
public static final ScriptContext<Factory> CONTEXT = new ScriptContext<>("movfn", Factory.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be "moving-function" to parallel the class name? While this is unseen for users most of the time, it can be used when putting a stored script to compile against that context.
* Find the maximum value in a window of values. | ||
* If all values are missing/null/NaN, the return value will be NaN | ||
*/ | ||
public static double max(Collection<Double> values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 concerns still:
- Painless doesn't know anything about generics. So to a script, this means passing in any Collection (that is all painless will check for). What about
double[]
? - Collection is unordered in java, but the order matters (I think?) for some of these calculations (not max, but others). Using
double[]
would solve this, but if we must stick with boxed types, at least useList<Double>
?
Note that doc['fieldname']
returns an object that implements List, but we could make this easier by adding an asArray()
method to DoubleScriptDocValues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think double[]
should be doable. 👍
I didn't do that originally to avoid unboxing the internal EvictingQueue
into an array since it wouldn't be necessary most of the time. And I avoided specifying a Queue
as the type since the user might want to use these functions outside of the values we provide (or in addition to).
But I didn't think about the unordered nature of Collection
, that's definitely an issue. Lemme poke at it tomorrow with double[]
, I think that'll work fine.
Sidebar
Alternatively, I could replace the evicting queue with a primitive-based ring buffer. I don't think that's any better though. We'd still have to arrayCopy the relevant slice into the script since there are null/NaNs to deal with, and we probably shouldn't make the method signature some weird ring buffer class :)
Note that doc['fieldname'] returns an object that implements List, but we could make this easier by adding an asArray() method to DoubleScriptDocValues?
Ah interesting. I'm not sure it's needed... do we think users will plop a field's value into the functions instead of the values? I was thinking users might tweak the values before a function, or write their own based on the values... but not use an entire field in place of the provided values. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget what I said about ScriptDocValues. I was thinking this was like scripted metric agg, which uses a SearchScript for the map script, but you did this correctly (no getDoc()). So no need to do anything with the doc values accessors.
@rjernst swapped over to |
These will be bumped to `6.3.99` after the feature has been backported to 6.4
@colings86 @rjernst did either of you want another review pass on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
Test was previously muted which hid the failure. The holt model in the test needed updating like the real model (was using incorrect starting count)
…ngs-to-true * elastic/master: (34 commits) Test: increase search logging for LicensingTests Adjust serialization version in IndicesOptions [TEST] Fix compilation Remove version argument in RangeFieldType (elastic#30411) Remove unused DirectoryUtils class. (elastic#30582) Mitigate date histogram slowdowns with non-fixed timezones. (elastic#30534) Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (elastic#29594) Removes AwaitsFix on IndicesOptionsTests Template upgrades should happen in a system context (elastic#30621) Fix bug in BucketMetrics path traversal (elastic#30632) Fixes IndiceOptionsTests to serialise correctly (elastic#30644) S3 repo plugin populate SettingsFilter (elastic#30652) mute IndicesOptionsTests.testSerialization Rest High Level client: Add List Tasks (elastic#29546) Refactors ClientHelper to combine header logic (elastic#30620) [ML] Wait for ML indices in rolling upgrade tests (elastic#30615) Watcher: Ensure secrets integration tests also run triggered watch (elastic#30478) Move allocation awareness attributes to list setting (elastic#30626) [Docs] Update code snippet in has-child-query.asciidoc (elastic#30510) Replace custom reloadable Key/TrustManager (elastic#30509) ...
* es/master: (74 commits) Preserve REST client auth despite 401 response (#30558) [test] packaging: add windows boxes (#30402) Make xpack modules instead of a meta plugin (#30589) Mute ShrinkIndexIT [ML] DeleteExpiredDataAction should use client with origin (#30646) Reindex: Fixed typo in assertion failure message (#30619) [DOCS] Fixes list of unconverted snippets in build.gradle [DOCS] Reorganizes RBAC documentation SQL: Remove dependency for server's version from JDBC driver (#30631) Test: increase search logging for LicensingTests Adjust serialization version in IndicesOptions [TEST] Fix compilation Remove version argument in RangeFieldType (#30411) Remove unused DirectoryUtils class. (#30582) Mitigate date histogram slowdowns with non-fixed timezones. (#30534) Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (#29594) Removes AwaitsFix on IndicesOptionsTests Template upgrades should happen in a system context (#30621) Fix bug in BucketMetrics path traversal (#30632) Fixes IndiceOptionsTests to serialise correctly (#30644) ...
* es/ccr: (75 commits) Preserve REST client auth despite 401 response (elastic#30558) [test] packaging: add windows boxes (elastic#30402) Make xpack modules instead of a meta plugin (elastic#30589) Mute ShrinkIndexIT [ML] DeleteExpiredDataAction should use client with origin (elastic#30646) Reindex: Fixed typo in assertion failure message (elastic#30619) [DOCS] Fixes list of unconverted snippets in build.gradle [DOCS] Reorganizes RBAC documentation SQL: Remove dependency for server's version from JDBC driver (elastic#30631) Test: increase search logging for LicensingTests Adjust serialization version in IndicesOptions [TEST] Fix compilation Remove version argument in RangeFieldType (elastic#30411) Remove unused DirectoryUtils class. (elastic#30582) Mitigate date histogram slowdowns with non-fixed timezones. (elastic#30534) Add a MovingFunction pipeline aggregation, deprecate MovingAvg agg (elastic#29594) Removes AwaitsFix on IndicesOptionsTests Template upgrades should happen in a system context (elastic#30621) Fix bug in BucketMetrics path traversal (elastic#30632) Fixes IndiceOptionsTests to serialise correctly (elastic#30644) ...
…lastic#29594) This pipeline aggregation gives the user the ability to script functions that "move" across a window of data, instead of single data points. It is the scripted version of MovingAvg pipeline agg. Through custom script contexts, we expose a number of convenience methods: - MovingFunctions.max() - MovingFunctions.min() - MovingFunctions.sum() - MovingFunctions.unweightedAvg() - MovingFunctions.linearWeightedAvg() - MovingFunctions.ewma() - MovingFunctions.holt() - MovingFunctions.holtWinters() - MovingFunctions.stdDev() The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
See elastic/elasticsearch#29594 (cherry picked from commit 6c4f69f)
Documents the removal of the `moving_avg` aggregation in the 8.0 breaking changes. Relates to #29594.
This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points. Through custom script contexts, we also expose a number of convenience methods:
windowMax()
windowMin()
windowSum()
simpleMovAvg()
linearMovAvg()
ewmaMovAvg()
holtMovAvg()
holtWintersMovAvg()
Since we can easily expose moving average functionality through custom script contexts, I would like to deprecate the
moving_avg
agg at the same time if there isn't objection. Thoughts?At the moment, this agg doesn't support "predictions" like the old agg... but I think that is a good thing. The "prediction" functionality is just super basic extrapolation, is fairly confusing for users and of questionable value. It also means we can eventually remove the simulated annealing optimizer, since that was mainly there to help minimize the parameters for holt-winters prediction.
I gutted the math portions of the moving avg models and consolidated them in this PR. There's still a lot of boilerplate cruft from the old models left, but that's just the way it is. In the process I de-generified the models to make it work with MovFn, and generics weren't needed anyway.
This is still a WIP, needs docs, etc. @rjernst would you mind taking a look at the scripting context stuff to make sure it looks ok? I'll track down a search/aggs team member for the rest of the review when it's ready.
This is a reboot of #25137