Consolidate `BoundedAggregateStream` #6932

alamb · 2023-07-12T17:20:05Z

Which issue does this PR close?

Rationale for this change

See #6798 -- basically BoundedAggregateStream is a lot of copy/paste code from the old group by hash implementation which prevents me from deleting RowFormat

Also, I think this will have the very nice side benefit of being significantly faster (due to not using ScalarValue to compare order by keys).

And it takes less code

What changes are included in this PR?

Add GroupOrdering to encapsulate tracking the state of any ordering
Update GroupedAggregateStream to use GroupOrdering when available
Remove BoundedAggregateStream code

I plan to remove the Row format as a follow on PR (#6968) to keep this one reasonably sized

Potential follow on work

There is a tradeoff between how fast the hash table is flushed (e.g. for low latency streaming use cases) and the overhead of generating output / resetting state. In this PR I took the same approach as BoundedAggregateStream which is to flush any completed groups as soon as possible, it might be worth adding a config parameter to tradeoff between latency and efficiency.

Performance Results

I ran the benchmarks results and they showed basically no change (I am seeing significant variance in test speeds)

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_consolidated_streaming ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  538.24ms │                     542.37ms │     no change │
│ QQuery 2     │  154.76ms │                     159.65ms │     no change │
│ QQuery 3     │  158.91ms │                     163.08ms │     no change │
│ QQuery 4     │  116.72ms │                     114.99ms │     no change │
│ QQuery 5     │  378.90ms │                     388.11ms │     no change │
│ QQuery 6     │   41.26ms │                      40.88ms │     no change │
│ QQuery 7     │  853.86ms │                     835.10ms │     no change │
│ QQuery 8     │  241.75ms │                     241.55ms │     no change │
│ QQuery 9     │  540.67ms │                     548.33ms │     no change │
│ QQuery 10    │  305.35ms │                     331.19ms │  1.08x slower │
│ QQuery 11    │  164.17ms │                     163.74ms │     no change │
│ QQuery 12    │  165.40ms │                     167.82ms │     no change │
│ QQuery 13    │  314.07ms │                     292.19ms │ +1.07x faster │
│ QQuery 14    │   49.98ms │                      46.03ms │ +1.09x faster │
│ QQuery 15    │   52.75ms │                      58.21ms │  1.10x slower │
│ QQuery 16    │  160.10ms │                     161.86ms │     no change │
│ QQuery 17    │  920.34ms │                     836.55ms │ +1.10x faster │
│ QQuery 18    │ 1570.58ms │                    1522.12ms │     no change │
│ QQuery 19    │  166.14ms │                     168.50ms │     no change │
│ QQuery 20    │  312.45ms │                     331.00ms │  1.06x slower │
│ QQuery 21    │ 1059.63ms │                    1062.09ms │     no change │
│ QQuery 22    │   84.26ms │                      85.10ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘

Are these changes tested?

Existing tests

Are there any user-facing changes?

Faster performance, smaller code size

alamb · 2023-07-15T20:18:56Z

datafusion/core/src/physical_plan/aggregates/mod.rs

@@ -1116,6 +1104,7 @@ fn create_accumulators(
        .collect::<Result<Vec<_>>>()
 }

+#[allow(dead_code)]


This will be removed in #6968

alamb · 2023-07-15T20:40:36Z

datafusion/core/src/physical_plan/aggregates/order/full.rs

@@ -0,0 +1,170 @@
+// Licensed to the Apache Software Foundation (ASF) under one


This design tries to keep all the ordering isolated out of the main row_hash.rs logic to manage the complexity

alamb · 2023-07-15T20:41:28Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

-
-                            if let Err(e) = result {
-                                return Poll::Ready(Some(Err(e)));
+                            // Do the grouping


The main state machine is updated to emit data when possible (so it needs to transition back and forth between input and emitting)

alamb · 2023-07-15T20:42:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    /// Optional ordering information, that might allow groups to be
+    /// emitted from the hash table prior to seeing the end of the
+    /// input
+    group_ordering: GroupOrdering,


I am quite pleased that the extra ordering state is tracked in a single struct

alamb · 2023-07-15T20:43:52Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

    ) -> Result<()> {
        // Convert the group keys into the row format
        // Avoid reallocation when https://github.com/apache/arrow-rs/issues/4479 is available
        let group_rows = self.row_converter.convert_columns(group_values)?;
        let n_rows = group_rows.num_rows();

        // track memory used
-        let group_values_size_pre = self.group_values.size();
-        let scratch_size_pre = self.scratch_space.size();
+        memory_delta.dec(self.state_size());


I needed to update the memory accounting logic because previously the group operator never decreased its memory reservation, but now that state is cleared early, it need to shrink as well

alamb · 2023-07-15T20:44:41Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+
+        // Update ordering information if necessary
+        let total_num_groups = self.group_values.num_rows();
+        if total_num_groups > starting_num_groups {


Here is where the ordering information is updated, and it is a few checks per batch overhead when there is no ordering

alamb · 2023-07-15T20:52:07Z

datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs

@@ -71,7 +71,7 @@ impl AccumulatorState {
    fn size(&self) -> usize {
        self.accumulator.size()
            + std::mem::size_of_val(self)
-            + std::mem::size_of::<u32>() * self.indices.capacity()
+            + self.indices.allocated_size()


This adapter needs to be updated to account for the fact that groups can be removed and thus memory freed

alamb · 2023-07-15T21:09:25Z

datafusion/core/src/physical_plan/aggregates/order/partial.rs

+        /// first group index with the sort_key
+        current_sort: usize,
+        /// The sort key of group_index `current_sort`
+        sort_key: OwnedRow,


I expect this code to be quite a bit faster than the current BoundedWindowAggregate as it uses the row format rather than ScalarValue, but I don't know of any benchmarks of the streaming group by code

It occurs to me that a potentially faster way to detect the group boundaries would be to use the inequality kernel offset by one w.r.t to each other, ORing the results together, and then iterating over the set bits.

There would be some subtlety to handle nulls correctly, but it would likely be significantly faster and would not require converting to the row format

We could likely do something similar for window functions if we aren't already

This is an excellent idea -- I filed it as #7023 for follow on work

From my perspective this PR will already be faster than the existing streaming group by (which uses ScalaValue to track the sort keys) so I think it is acceptable to merge as is

datafusion/core/src/physical_plan/aggregates/order/full.rs

datafusion/core/src/physical_plan/aggregates/order/mod.rs

datafusion/core/src/physical_plan/aggregates/order/partial.rs

mustafasrepo

I have left minor comments, nothing important. This PR eases the maintenance, improves readability, decreases code size. Thanks @alamb for this great work.

…reaming

Co-authored-by: Mustafa Akur <[email protected]>

…-datafusion into alamb/consolidated_streaming

alamb · 2023-07-17T18:15:53Z

BTW @mustafasrepo and @ozankabak I meant to mention this before but the tests in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs helped me implement and debug this code very nicely 👌

Thank you for that

alamb

I plan to leave this open for at least another day. cc @mingmwang or @Dandandan or @yahoNanJing in case you are interested.

alamb · 2023-07-18T20:36:02Z

I plan to merge this in tomorrow morning unless anyone else wants additional time to review

tustvold

I'm not an authority on this code, but left some comments. I think this could possibly be implemented without needing the row format at all, which would likely be close to optimal from a performance standpoint

datafusion/core/src/physical_plan/aggregates/order/full.rs

tustvold · 2023-07-18T20:57:20Z

datafusion/core/src/physical_plan/aggregates/order/partial.rs

+///     │┌───┐│    │ ┌──────────────┐  │ │     ┗━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━┛
+///     ││ 0 ││    │ │  123, "MA"   │  │ │        current_sort      sort_key
+///     │└───┘│    │ └──────────────┘  │ │
+///     │ ... │    │    ...            │ │      current_sort tracks the most


This states "most recent" but I think it means "oldest" or possibly "smallest"?

Agreed -- changed to 'smallest' in ddaf0e7

datafusion/core/src/physical_plan/aggregates/order/partial.rs

tustvold · 2023-07-18T21:05:48Z

datafusion/core/src/physical_plan/aggregates/order/partial.rs

+        /// first group index with the sort_key
+        current_sort: usize,
+        /// The sort key of group_index `current_sort`
+        sort_key: OwnedRow,


It occurs to me that a potentially faster way to detect the group boundaries would be to use the inequality kernel offset by one w.r.t to each other, ORing the results together, and then iterating over the set bits.

There would be some subtlety to handle nulls correctly, but it would likely be significantly faster and would not require converting to the row format

We could likely do something similar for window functions if we aren't already

tustvold · 2023-07-18T21:10:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -303,6 +322,16 @@ fn create_group_accumulator(
    }
 }

+/// Extracts a successful Ok(_) or returns Poll::Ready(Some(Err(e))) with errors
+macro_rules! extract_ok {


One way to avoid this is to extract a sync function that returns a Result, allowing the use of ? and then map as necessary

I tried this - I am not sure it is better -- I will put up a follow on PR #7025

tustvold · 2023-07-18T21:11:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+
+                            // If we can begin emitting rows, do so,
+                            // otherwise keep consuming input
+                            let to_emit = if self.input_done {


How would this branch be reached?

I agree -- input_done can't be true here as we just got a batch. I will change it to an assert. 372196e

tustvold · 2023-07-18T21:14:37Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+            }
+            EmitTo::First(n) => {
+                // Clear out first n group keys by copying them to a new Rows.
+                // TODO file some ticket in arrow-rs to make this more efficent?


I'm not sure this can be made efficient, but I also am not sure this aggregator needs to use the row format at all

In this case, the group_values also need to support "remove the first N values" even if we removed the row format from the partial order state

tustvold · 2023-07-18T22:03:57Z

datafusion/core/src/physical_plan/aggregates/order/full.rs

+    pub fn new_groups(
+        &mut self,
+        group_indices: &[usize],
+        batch_hashes: &[u64],


In the case of GroupOrderingFull it seems unnecessary to be computing hashes at all, we can just group based on whenever the sort key changes?

That is a (very) good point -- noted in #7023 for follow on work

tustvold · 2023-07-19T00:15:32Z

datafusion/execution/src/memory_pool/mod.rs

+///
+/// 3. Call `delta.inc(size_thing.size())`
+#[derive(Debug, Default)]
+pub struct MemoryDelta {


In https://github.com/apache/arrow-datafusion/pull/7016/files#r1267407414 I opted to just remove the delta-based accounting as I couldn't see a compelling reason to keep it around, given it was computing total memory usage, and using this to compute deltas, in order to then update the total memory usage

The reason there was delta accounting is that in previous incarnations, calculating the size for the overall grouping operator was a significant bottleneck (for groupings with large cardinality groups) .

Specifically Accumulator::size() showed up on a bunch of profiles.

However, now that you mention it the delta accounting is now done per-Accumulator in the adapter so the actual group by hash operator can probably remove the delta accounting

Here is my proposal:

I will add some comments on the rationale for delta accounting to this PR

I merge this MR

You can either remove the delta accounting in Extract GroupValues (#6969) #7016 in the main GroupByHash operator or I will do so and we can run some benchmarks to make sure it doesn't have performance impact

Comments in e7dc2ae

Given #7016 (comment) showed no regression I will opt to simply remove it

…reaming

Co-authored-by: Raphael Taylor-Davies <[email protected]>

…-datafusion into alamb/consolidated_streaming

alamb

Thank you for the comments @tustvold -- they were very helpful.

alamb · 2023-07-19T09:59:20Z

datafusion/core/src/physical_plan/aggregates/order/full.rs

+    pub fn new_groups(
+        &mut self,
+        group_indices: &[usize],
+        batch_hashes: &[u64],


That is a (very) good point -- noted in #7023 for follow on work

alamb · 2023-07-19T10:00:41Z

datafusion/core/src/physical_plan/aggregates/order/partial.rs

+        /// first group index with the sort_key
+        current_sort: usize,
+        /// The sort key of group_index `current_sort`
+        sort_key: OwnedRow,


This is an excellent idea -- I filed it as #7023 for follow on work

From my perspective this PR will already be faster than the existing streaming group by (which uses ScalaValue to track the sort keys) so I think it is acceptable to merge as is

alamb · 2023-07-19T10:08:29Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+            }
+            EmitTo::First(n) => {
+                // Clear out first n group keys by copying them to a new Rows.
+                // TODO file some ticket in arrow-rs to make this more efficent?


In this case, the group_values also need to support "remove the first N values" even if we removed the row format from the partial order state

alamb · 2023-07-19T10:16:34Z

datafusion/core/src/physical_plan/aggregates/order/partial.rs

+///     │┌───┐│    │ ┌──────────────┐  │ │     ┗━━━━━━━━━━━━━━━━━┛ ┗━━━━━━━┛
+///     ││ 0 ││    │ │  123, "MA"   │  │ │        current_sort      sort_key
+///     │└───┘│    │ └──────────────┘  │ │
+///     │ ... │    │    ...            │ │      current_sort tracks the most


Agreed -- changed to 'smallest' in ddaf0e7

alamb · 2023-07-19T10:16:44Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+
+                            // If we can begin emitting rows, do so,
+                            // otherwise keep consuming input
+                            let to_emit = if self.input_done {


I agree -- input_done can't be true here as we just got a batch. I will change it to an assert. 372196e

alamb · 2023-07-19T10:17:10Z

datafusion/execution/src/memory_pool/mod.rs

+///
+/// 3. Call `delta.inc(size_thing.size())`
+#[derive(Debug, Default)]
+pub struct MemoryDelta {


Comments in e7dc2ae

alamb · 2023-07-19T10:38:40Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -303,6 +322,16 @@ fn create_group_accumulator(
    }
 }

+/// Extracts a successful Ok(_) or returns Poll::Ready(Some(Err(e))) with errors
+macro_rules! extract_ok {


I tried this - I am not sure it is better -- I will put up a follow on PR #7025

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 12, 2023

alamb force-pushed the alamb/consolidated_streaming branch 4 times, most recently from e429743 to 2da356c Compare July 14, 2023 18:41

alamb mentioned this pull request Jul 14, 2023

Remove RowAccumulators and datafusion-row #6968

Merged

alamb force-pushed the alamb/consolidated_streaming branch 2 times, most recently from 2c013a1 to aeac248 Compare July 15, 2023 12:44

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jul 15, 2023

alamb force-pushed the alamb/consolidated_streaming branch 2 times, most recently from b42fc16 to 17cc833 Compare July 15, 2023 21:06

Consolidate BoundedAggregateStream

5b874da

alamb force-pushed the alamb/consolidated_streaming branch from 17cc833 to 5b874da Compare July 15, 2023 21:10

alamb commented Jul 15, 2023

View reviewed changes

alamb marked this pull request as ready for review July 15, 2023 21:11

alamb requested a review from mustafasrepo July 15, 2023 21:35

alamb mentioned this pull request Jul 15, 2023

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798

Closed

mustafasrepo reviewed Jul 17, 2023

View reviewed changes

datafusion/core/src/physical_plan/aggregates/order/full.rs Outdated Show resolved Hide resolved

mustafasrepo reviewed Jul 17, 2023

View reviewed changes

datafusion/core/src/physical_plan/aggregates/order/mod.rs Outdated Show resolved Hide resolved

mustafasrepo reviewed Jul 17, 2023

View reviewed changes

datafusion/core/src/physical_plan/aggregates/order/partial.rs Outdated Show resolved Hide resolved

mustafasrepo reviewed Jul 17, 2023

View reviewed changes

datafusion/core/src/physical_plan/aggregates/order/partial.rs Outdated Show resolved Hide resolved

mustafasrepo approved these changes Jul 17, 2023

View reviewed changes

alamb and others added 5 commits July 17, 2023 14:04

Merge remote-tracking branch 'apache/main' into alamb/consolidated_st…

92d4070

…reaming

Clarify end of input

c45806a

Apply suggestions from code review

3c647ce

Co-authored-by: Mustafa Akur <[email protected]>

Merge branch 'alamb/consolidated_streaming' of github.com:alamb/arrow…

da45915

…-datafusion into alamb/consolidated_streaming

Update diagram for partial sort

4814da4

alamb commented Jul 17, 2023

View reviewed changes

tustvold reviewed Jul 18, 2023

View reviewed changes

tustvold reviewed Jul 19, 2023

View reviewed changes

alamb mentioned this pull request Jul 19, 2023

Improved performance for streaming group by #7023

Closed

alamb and others added 6 commits July 19, 2023 06:03

Merge remote-tracking branch 'apache/main' into alamb/consolidated_st…

048b5cd

…reaming

assert input is not done

372196e

Apply suggestions from code review

6d1526f

Co-authored-by: Raphael Taylor-Davies <[email protected]>

clarify text

ddaf0e7

Add more comments about delta memory accounting

e7dc2ae

Merge branch 'alamb/consolidated_streaming' of github.com:alamb/arrow…

ad8c083

…-datafusion into alamb/consolidated_streaming

alamb mentioned this pull request Jul 19, 2023

Minor: Refactor group by state machine #7025

Closed

alamb commented Jul 19, 2023

View reviewed changes

alamb merged commit 1810a15 into apache:main Jul 19, 2023

alamb deleted the alamb/consolidated_streaming branch July 19, 2023 10:50

alamb mentioned this pull request Jul 19, 2023

Extract GroupValues (#6969) #7016

Merged

		@@ -0,0 +1,170 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Consolidate BoundedAggregateStream #6932

Consolidate BoundedAggregateStream #6932

Conversation

alamb commented Jul 12, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Potential follow on work

Performance Results

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mustafasrepo left a comment

Choose a reason for hiding this comment

alamb commented Jul 17, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 18, 2023

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Consolidate `BoundedAggregateStream` #6932

Consolidate `BoundedAggregateStream` #6932

alamb commented Jul 12, 2023 •

edited

Loading

tustvold Jul 19, 2023 •

edited

Loading

alamb Jul 19, 2023 •

edited

Loading