Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rate agg with custom _doc_count #79346

Merged
merged 13 commits into from
Oct 19, 2021
Merged

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented Oct 18, 2021

When running a rate aggregation without setting the field parameter, the result is computed based on the bucket doc_count.

This PR adds support for a custom _doc_count field.

Closes #77734

@csoulios csoulios requested a review from nik9000 October 18, 2021 09:59
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@csoulios csoulios marked this pull request as draft October 18, 2021 09:59
So, all aggregators can use the doc_count
@csoulios csoulios marked this pull request as ready for review October 18, 2021 12:50
@nik9000 nik9000 requested a review from imotov October 18, 2021 13:32
@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about the code, but I wonder if it'd be cleaner to go about this by linking into the value source resolution mechanism that RateAggregationBuilder#resolveConfig uses - right now it creates a dummy "unmapped Numeric" values source which resolves missing values to 1. I wonder if it'd be cleaner to make a DOC_COUNT values source type and plug it in here. If that worked you wouldn't need any changes to AggregatorBase or even the numeric aggregator. It'd all be plumbing the doc count provider through the values source infrastructure that all other aggs use.

}

compensations.set(bucket, kahanSummation.delta());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things jump out at me here:

  1. When we're summing doc count or value count we might be better off storing all this in long instead of double. Then we wouldn't need Kahan at all.
  2. This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.

Copy link
Contributor Author

@csoulios csoulios Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good observations.

1. When we're summing doc count or value count we might be better off storing all this in long instead of double. Then we wouldn't need Kahan at all.

This can be a good optimization mostly because we save memory by using a LongArray instead of two DoubleArrays. I was just thinking that this could be a separate PR because it requires more subtle handling in methods such as metric(long) and buildAggregation(long).

2. This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.

All "expensive" Kahan computations happen at kahan.add() method. kahan.reset(), kahan.value() and kahan.delta() are plain setters and getters. The only possible overhead I can see here are the BigArrays operations. So, I moved those in the conditionals so that they are performed only if values exist.

It felt too risky to move DocCountProvider to BaseAggregator
At least for the 7.16 release.

NumericRateAggregator has its own DocCountProvider
This is a more conservative change.
Do not perform any Kahan computations when no fields exist
@csoulios csoulios changed the title Fix rate agg with custom _doc_count Fix rate agg with custom _doc_count Oct 18, 2021
@@ -32,24 +36,35 @@ public NumericRateAggregator(
Map<String, Object> metadata
) throws IOException {
super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata);
docCountProvider = new DocCountProvider();
}

@Override
public LeafBucketCollector getLeafCollector(LeafReaderContext ctx, final LeafBucketCollector sub) throws IOException {
final CompensatedSum kahanSummation = new CompensatedSum(0, 0);
final SortedNumericDoubleValues values = ((ValuesSource.Numeric) valuesSource).doubleValues(ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're in computeRateOnDocs mode having a valuesSource at all is a little confusing. I think we're kind of stuck with it in the short term because building the doc_count values source is a bit of a big thing, but maybe it's worth a comment or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could actually return a different LefBucketCollector if you are in computeRateOnDocs mode. I'm not sure if that'd be clearer. It's more copy and paste or an extra subclass. So maybe not. Probably not.

@csoulios csoulios requested a review from nik9000 October 18, 2021 16:43
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this a good fix for backporting to 7.16. I like the idea of making a doc_count value source type but that's a bigger project I think. It can wait.

@csoulios
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

@@ -32,24 +36,35 @@ public NumericRateAggregator(
Map<String, Object> metadata
) throws IOException {
super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata);
docCountProvider = new DocCountProvider();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if computeRateOnDocs if false?

Copy link
Contributor Author

@csoulios csoulios Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't need the DocCountProvider object when computeRateOnDocs == false.

I could replace it with something like docCountProvider = computeRateOnDocs ? new DocCountProvider() : null;. Only that DocCountProvider has no state and consumes very little memory. So, I only instatiate it like this for simplicity.

I followed @nik9000 's advice and created a separate LeafBucketCollectorBase instance for the computeRateOnDocs == true case. So another approach would be to instantiate the DocCountProvider as a local variable in the getLeafCollector() method. However, this would create a new instance for every shard, I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way to approach this would be to move the check for computeRateOnDocs into here and then assign new DocCountProvider(); or null to docCountProvider, if docCountProvider is not null - we use it if it is null, we go value count route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good approach I had initially thought as well. If I implemented the following lines in the NumericRateAggregator() ctor, I could totally get rid of the computeRateOnDocs member var:

        docCountProvider = (valuesSourceConfig.fieldContext() == null
            && valuesSourceConfig.script() == null
            && valuesSourceConfig.scriptValueType() == null) ? new DocCountProvider() : null;

I just felt that having an explicit variable computeRateOnDocs would make it look very obvious and simple. While, docCountProvider == null could happen for possibly other reasons in the future. So, I preferred explicit vs implicit.

I don't have strong feeling about it though. If you think this is the better/simpler approach, I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with either. I just don't understand the purpose of valuesSourceConfig.scriptValueType() == null check. Could you add a test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed. Checking for valuesSourceConfig.fieldContext() == null && valuesSourceConfig.script() == null is enough similar to RateAggregationBuilder#resolveConfig

I should remove it.

Now we create two separate LeafBucketCollectorsBase objects.
It may look more code, but I think this is cleaner.
@csoulios csoulios requested a review from imotov October 18, 2021 18:58
@csoulios csoulios merged commit de93d95 into elastic:master Oct 19, 2021
@csoulios csoulios deleted the fix-rate-agg branch October 19, 2021 10:25
csoulios added a commit that referenced this pull request Oct 19, 2021
Backports #79346 to 7.x

    When running a rate aggregation without setting the field parameter, the result is computed based on the bucket doc_count.

    This PR adds support for a custom _doc_count field.

    Closes #77734
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 19, 2021
* upstream/master:
  Validate tsdb's routing_path (elastic#79384)
  Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436)
  API for adding and removing indices from a data stream (elastic#79279)
  Exposing the ability to log deprecated settings at non-critical level (elastic#79107)
  Convert operator privilege license object to LicensedFeature (elastic#79407)
  Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456)
  Create cache files with CREATE_NEW & SPARSE options (elastic#79371)
  Revert "[ML] Use a new annotations index for future annotations (elastic#79151)"
  [ML] Use a new annotations index for future annotations (elastic#79151)
  [ML] Removing legacy code from ML/transform auditor (elastic#79434)
  Fix rate agg with custom `_doc_count` (elastic#79346)
  Optimize SLM Policy Queries (elastic#79341)
  Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841)
  Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227)
  Do not release snapshot file download permit during recovery retries (elastic#79409)
  Preserve request headers in a mixed version cluster (elastic#79412)
  Adjust versions after elastic#79044 backport to 7.x (elastic#79424)
  Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429)
  Fail on SSPL licensed x-pack sources (elastic#79348)

# Conflicts:
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
csoulios added a commit that referenced this pull request Oct 20, 2021
Now that the rate agg fix (#79346) was backported to v7.16 (#79449), we change the minimum version
for the test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect rate agg with custom _doc_count
5 participants