-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rate agg with custom _doc_count
#79346
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
So, all aggregators can use the doc_count
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about the code, but I wonder if it'd be cleaner to go about this by linking into the value source resolution mechanism that RateAggregationBuilder#resolveConfig
uses - right now it creates a dummy "unmapped Numeric" values source which resolves missing values to 1
. I wonder if it'd be cleaner to make a DOC_COUNT
values source type and plug it in here. If that worked you wouldn't need any changes to AggregatorBase
or even the numeric aggregator. It'd all be plumbing the doc count provider through the values source infrastructure that all other aggs use.
server/src/main/java/org/elasticsearch/search/aggregations/AggregatorBase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/aggregations/AggregatorBase.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
compensations.set(bucket, kahanSummation.delta()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things jump out at me here:
- When we're summing doc count or value count we might be better off storing all this in
long
instead ofdouble
. Then we wouldn't need Kahan at all. - This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good observations.
1. When we're summing doc count or value count we might be better off storing all this in long instead of double. Then we wouldn't need Kahan at all.
This can be a good optimization mostly because we save memory by using a LongArray
instead of two DoubleArray
s. I was just thinking that this could be a separate PR because it requires more subtle handling in methods such as metric(long)
and buildAggregation(long)
.
2. This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.
All "expensive" Kahan computations happen at kahan.add()
method. kahan.reset()
, kahan.value()
and kahan.delta()
are plain setters and getters. The only possible overhead I can see here are the BigArrays
operations. So, I moved those in the conditionals so that they are performed only if values exist.
It felt too risky to move DocCountProvider to BaseAggregator At least for the 7.16 release. NumericRateAggregator has its own DocCountProvider This is a more conservative change.
Do not perform any Kahan computations when no fields exist
_doc_count
@@ -32,24 +36,35 @@ public NumericRateAggregator( | |||
Map<String, Object> metadata | |||
) throws IOException { | |||
super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata); | |||
docCountProvider = new DocCountProvider(); | |||
} | |||
|
|||
@Override | |||
public LeafBucketCollector getLeafCollector(LeafReaderContext ctx, final LeafBucketCollector sub) throws IOException { | |||
final CompensatedSum kahanSummation = new CompensatedSum(0, 0); | |||
final SortedNumericDoubleValues values = ((ValuesSource.Numeric) valuesSource).doubleValues(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're in computeRateOnDocs
mode having a valuesSource
at all is a little confusing. I think we're kind of stuck with it in the short term because building the doc_count values source is a bit of a big thing, but maybe it's worth a comment or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could actually return a different LefBucketCollector
if you are in computeRateOnDocs
mode. I'm not sure if that'd be clearer. It's more copy and paste or an extra subclass. So maybe not. Probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this a good fix for backporting to 7.16. I like the idea of making a doc_count
value source type but that's a bigger project I think. It can wait.
@elasticmachine run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments.
@@ -32,24 +36,35 @@ public NumericRateAggregator( | |||
Map<String, Object> metadata | |||
) throws IOException { | |||
super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata); | |||
docCountProvider = new DocCountProvider(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this if computeRateOnDocs if false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't need the DocCountProvider
object when computeRateOnDocs == false
.
I could replace it with something like docCountProvider = computeRateOnDocs ? new DocCountProvider() : null;
. Only that DocCountProvider
has no state and consumes very little memory. So, I only instatiate it like this for simplicity.
I followed @nik9000 's advice and created a separate LeafBucketCollectorBase
instance for the computeRateOnDocs == true
case. So another approach would be to instantiate the DocCountProvider
as a local variable in the getLeafCollector()
method. However, this would create a new instance for every shard, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplest way to approach this would be to move the check for computeRateOnDocs into here and then assign new DocCountProvider();
or null
to docCountProvider, if docCountProvider is not null - we use it if it is null, we go value count route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good approach I had initially thought as well. If I implemented the following lines in the NumericRateAggregator() ctor, I could totally get rid of the computeRateOnDocs
member var:
docCountProvider = (valuesSourceConfig.fieldContext() == null
&& valuesSourceConfig.script() == null
&& valuesSourceConfig.scriptValueType() == null) ? new DocCountProvider() : null;
I just felt that having an explicit variable computeRateOnDocs
would make it look very obvious and simple. While, docCountProvider == null
could happen for possibly other reasons in the future. So, I preferred explicit vs implicit.
I don't have strong feeling about it though. If you think this is the better/simpler approach, I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with either. I just don't understand the purpose of valuesSourceConfig.scriptValueType() == null
check. Could you add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not needed. Checking for valuesSourceConfig.fieldContext() == null && valuesSourceConfig.script() == null
is enough similar to RateAggregationBuilder#resolveConfig
Line 193 in 20c9f75
if (field() == null && script() == null) { |
I should remove it.
Now we create two separate LeafBucketCollectorsBase objects. It may look more code, but I think this is cleaner.
* upstream/master: Validate tsdb's routing_path (elastic#79384) Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436) API for adding and removing indices from a data stream (elastic#79279) Exposing the ability to log deprecated settings at non-critical level (elastic#79107) Convert operator privilege license object to LicensedFeature (elastic#79407) Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456) Create cache files with CREATE_NEW & SPARSE options (elastic#79371) Revert "[ML] Use a new annotations index for future annotations (elastic#79151)" [ML] Use a new annotations index for future annotations (elastic#79151) [ML] Removing legacy code from ML/transform auditor (elastic#79434) Fix rate agg with custom `_doc_count` (elastic#79346) Optimize SLM Policy Queries (elastic#79341) Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841) Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227) Do not release snapshot file download permit during recovery retries (elastic#79409) Preserve request headers in a mixed version cluster (elastic#79412) Adjust versions after elastic#79044 backport to 7.x (elastic#79424) Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429) Fail on SSPL licensed x-pack sources (elastic#79348) # Conflicts: # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
When running a
rate
aggregation without setting thefield
parameter, the result is computed based on the bucketdoc_count
.This PR adds support for a custom
_doc_count
field.Closes #77734