Fix rate agg with custom `_doc_count` #79346

csoulios · 2021-10-18T09:59:11Z

When running a rate aggregation without setting the field parameter, the result is computed based on the bucket doc_count.

This PR adds support for a custom _doc_count field.

Closes #77734

elasticmachine · 2021-10-18T09:59:16Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

So, all aggregators can use the doc_count

csoulios · 2021-10-18T13:59:16Z

@elasticmachine run elasticsearch-ci/bwc

nik9000

I left some comments about the code, but I wonder if it'd be cleaner to go about this by linking into the value source resolution mechanism that RateAggregationBuilder#resolveConfig uses - right now it creates a dummy "unmapped Numeric" values source which resolves missing values to 1. I wonder if it'd be cleaner to make a DOC_COUNT values source type and plug it in here. If that worked you wouldn't need any changes to AggregatorBase or even the numeric aggregator. It'd all be plumbing the doc count provider through the values source infrastructure that all other aggs use.

server/src/main/java/org/elasticsearch/search/aggregations/AggregatorBase.java

nik9000 · 2021-10-18T13:47:43Z

...in/analytics/src/main/java/org/elasticsearch/xpack/analytics/rate/NumericRateAggregator.java

                }
+
+                compensations.set(bucket, kahanSummation.delta());


Two things jump out at me here:

When we're summing doc count or value count we might be better off storing all this in long instead of double. Then we wouldn't need Kahan at all.

This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.

Very good observations.

1. When we're summing doc count or value count we might be better off storing all this in long instead of double. Then we wouldn't need Kahan at all.

This can be a good optimization mostly because we save memory by using a LongArray instead of two DoubleArrays. I was just thinking that this could be a separate PR because it requires more subtle handling in methods such as metric(long) and buildAggregation(long).

2. This does kahan stuff if there aren't any values for the field. I'll be most of the time folks use this on pretty dense fields. But this isn't a change I'd want to make in our (hopefully) conservative 7.16 release. Again, it probably doesn't have a performance impact in most cases, but 7.16 will live a long long time so the odds are better than normal that it'll come up.

All "expensive" Kahan computations happen at kahan.add() method. kahan.reset(), kahan.value() and kahan.delta() are plain setters and getters. The only possible overhead I can see here are the BigArrays operations. So, I moved those in the conditionals so that they are performed only if values exist.

It felt too risky to move DocCountProvider to BaseAggregator At least for the 7.16 release. NumericRateAggregator has its own DocCountProvider This is a more conservative change.

Do not perform any Kahan computations when no fields exist

nik9000 · 2021-10-18T16:24:20Z

...in/analytics/src/main/java/org/elasticsearch/xpack/analytics/rate/NumericRateAggregator.java

@@ -32,24 +36,35 @@ public NumericRateAggregator(
        Map<String, Object> metadata
    ) throws IOException {
        super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata);
+        docCountProvider = new DocCountProvider();
    }

    @Override
    public LeafBucketCollector getLeafCollector(LeafReaderContext ctx, final LeafBucketCollector sub) throws IOException {
        final CompensatedSum kahanSummation = new CompensatedSum(0, 0);
        final SortedNumericDoubleValues values = ((ValuesSource.Numeric) valuesSource).doubleValues(ctx);


When we're in computeRateOnDocs mode having a valuesSource at all is a little confusing. I think we're kind of stuck with it in the short term because building the doc_count values source is a bit of a big thing, but maybe it's worth a comment or something.

You could actually return a different LefBucketCollector if you are in computeRateOnDocs mode. I'm not sure if that'd be clearer. It's more copy and paste or an extra subclass. So maybe not. Probably not.

nik9000

I think this a good fix for backporting to 7.16. I like the idea of making a doc_count value source type but that's a bigger project I think. It can wait.

csoulios · 2021-10-18T17:55:27Z

@elasticmachine run elasticsearch-ci/bwc

imotov

Added some comments.

imotov · 2021-10-18T18:17:48Z

...in/analytics/src/main/java/org/elasticsearch/xpack/analytics/rate/NumericRateAggregator.java

@@ -32,24 +36,35 @@ public NumericRateAggregator(
        Map<String, Object> metadata
    ) throws IOException {
        super(name, valuesSourceConfig, rateUnit, rateMode, context, parent, metadata);
+        docCountProvider = new DocCountProvider();


Do we need this if computeRateOnDocs if false?

No, we don't need the DocCountProvider object when computeRateOnDocs == false.

I could replace it with something like docCountProvider = computeRateOnDocs ? new DocCountProvider() : null;. Only that DocCountProvider has no state and consumes very little memory. So, I only instatiate it like this for simplicity.

I followed @nik9000 's advice and created a separate LeafBucketCollectorBase instance for the computeRateOnDocs == true case. So another approach would be to instantiate the DocCountProvider as a local variable in the getLeafCollector() method. However, this would create a new instance for every shard, I guess.

I think the simplest way to approach this would be to move the check for computeRateOnDocs into here and then assign new DocCountProvider(); or null to docCountProvider, if docCountProvider is not null - we use it if it is null, we go value count route.

That's a good approach I had initially thought as well. If I implemented the following lines in the NumericRateAggregator() ctor, I could totally get rid of the computeRateOnDocs member var:

docCountProvider = (valuesSourceConfig.fieldContext() == null && valuesSourceConfig.script() == null && valuesSourceConfig.scriptValueType() == null) ? new DocCountProvider() : null;

I just felt that having an explicit variable computeRateOnDocs would make it look very obvious and simple. While, docCountProvider == null could happen for possibly other reasons in the future. So, I preferred explicit vs implicit.

I don't have strong feeling about it though. If you think this is the better/simpler approach, I can change it.

I am ok with either. I just don't understand the purpose of valuesSourceConfig.scriptValueType() == null check. Could you add a test for it?

This is probably not needed. Checking for valuesSourceConfig.fieldContext() == null && valuesSourceConfig.script() == null is enough similar to RateAggregationBuilder#resolveConfig

elasticsearch/x-pack/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/rate/RateAggregationBuilder.java

Line 193 in 20c9f75

if (field() == null && script() == null) {

I should remove it.

Now we create two separate LeafBucketCollectorsBase objects. It may look more code, but I think this is cleaner.

Backports #79346 to 7.x When running a rate aggregation without setting the field parameter, the result is computed based on the bucket doc_count. This PR adds support for a custom _doc_count field. Closes #77734

* upstream/master: Validate tsdb's routing_path (elastic#79384) Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436) API for adding and removing indices from a data stream (elastic#79279) Exposing the ability to log deprecated settings at non-critical level (elastic#79107) Convert operator privilege license object to LicensedFeature (elastic#79407) Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456) Create cache files with CREATE_NEW & SPARSE options (elastic#79371) Revert "[ML] Use a new annotations index for future annotations (elastic#79151)" [ML] Use a new annotations index for future annotations (elastic#79151) [ML] Removing legacy code from ML/transform auditor (elastic#79434) Fix rate agg with custom `_doc_count` (elastic#79346) Optimize SLM Policy Queries (elastic#79341) Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841) Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227) Do not release snapshot file download permit during recovery retries (elastic#79409) Preserve request headers in a mixed version cluster (elastic#79412) Adjust versions after elastic#79044 backport to 7.x (elastic#79424) Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429) Fail on SSPL licensed x-pack sources (elastic#79348) # Conflicts: # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java

Now that the rate agg fix (#79346) was backported to v7.16 (#79449), we change the minimum version for the test

csoulios added 4 commits October 18, 2021 12:51

Fix typo in docs

72b4c3b

Fix typo in docs

3b4064d

Added tests to reproduce bug

23b9c78

Compute rate based on doc counts

4f04d77

csoulios added >bug :Analytics/Aggregations Aggregations v8.0.0 v7.16.0 labels Oct 18, 2021

csoulios requested a review from nik9000 October 18, 2021 09:59

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 18, 2021

csoulios marked this pull request as draft October 18, 2021 09:59

Pulled DocCountProvider up to AggregatorBase

7c3a73d

So, all aggregators can use the doc_count

csoulios marked this pull request as ready for review October 18, 2021 12:50

nik9000 requested a review from imotov October 18, 2021 13:32

nik9000 reviewed Oct 18, 2021

View reviewed changes

csoulios added 3 commits October 18, 2021 18:29

Moved DocCountProvider back to BucketAggregator

07ddb9d

It felt too risky to move DocCountProvider to BaseAggregator At least for the 7.16 release. NumericRateAggregator has its own DocCountProvider This is a more conservative change.

Merge branch 'master' into fix-rate-agg

206f647

Remove unneeded Kahan computations

8cbcf50

Do not perform any Kahan computations when no fields exist

csoulios changed the title ~~Fix rate agg with custom _doc_count~~ Fix rate agg with custom _doc_count Oct 18, 2021

nik9000 reviewed Oct 18, 2021

View reviewed changes

csoulios requested a review from nik9000 October 18, 2021 16:43

nik9000 approved these changes Oct 18, 2021

View reviewed changes

imotov reviewed Oct 18, 2021

View reviewed changes

csoulios added 2 commits October 18, 2021 21:47

Separated the code paths for doc_count and values

e95753a

Now we create two separate LeafBucketCollectorsBase objects. It may look more code, but I think this is cleaner.

Merge branch 'master' into fix-rate-agg

47af043

csoulios requested a review from imotov October 18, 2021 18:58

Do not check valuesSourceConfig.scriptValueType

a77f6af

csoulios added 2 commits October 19, 2021 10:52

Merge branch 'master' into fix-rate-agg

2d0ea4e

Minor change

b941998

csoulios merged commit de93d95 into elastic:master Oct 19, 2021

csoulios deleted the fix-rate-agg branch October 19, 2021 10:25

csoulios added the backport pending label Oct 19, 2021

csoulios mentioned this pull request Oct 19, 2021

[7.x] Fix rate agg with custom _doc_count #79449

Merged

csoulios removed the backport pending label Oct 19, 2021

csoulios mentioned this pull request Oct 19, 2021

Enable rate agg test for 7.16 #79471

Merged

csoulios added a commit that referenced this pull request Oct 20, 2021

Enable rate agg test for 7.16 (#79471)

7642e78

Now that the rate agg fix (#79346) was backported to v7.16 (#79449), we change the minimum version for the test

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rate agg with custom `_doc_count` #79346

Fix rate agg with custom `_doc_count` #79346

csoulios commented Oct 18, 2021 •

edited

Loading

elasticmachine commented Oct 18, 2021

csoulios commented Oct 18, 2021

nik9000 left a comment

nik9000 Oct 18, 2021

csoulios Oct 18, 2021 •

edited

Loading

nik9000 Oct 18, 2021

nik9000 Oct 18, 2021

nik9000 left a comment

csoulios commented Oct 18, 2021

imotov left a comment

imotov Oct 18, 2021

csoulios Oct 18, 2021 •

edited

Loading

imotov Oct 18, 2021

csoulios Oct 18, 2021

imotov Oct 18, 2021

csoulios Oct 18, 2021

Fix rate agg with custom _doc_count #79346

Fix rate agg with custom _doc_count #79346

Conversation

csoulios commented Oct 18, 2021 • edited Loading

elasticmachine commented Oct 18, 2021

csoulios commented Oct 18, 2021

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csoulios Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

csoulios commented Oct 18, 2021

imotov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csoulios Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix rate agg with custom `_doc_count` #79346

Fix rate agg with custom `_doc_count` #79346

csoulios commented Oct 18, 2021 •

edited

Loading

csoulios Oct 18, 2021 •

edited

Loading

csoulios Oct 18, 2021 •

edited

Loading