feat: Support spilling for hash aggregation #7400

kazuyukitanimura · 2023-08-24T17:11:26Z

Which issue does this PR close?

Rationale for this change

For hash aggregation, the sizes of group values and accumulators can become large that may cause out of memory.

What changes are included in this PR?

This PR lets the hash aggregation operator spill large data to local disk using Arrow IPC format. For every input RecordBatch, the memory manager checks whether the new input size meets the memory configuration. If not, spilling happens, and later stream-merge sorts the spilled data to read back. The stream merge sort function was reused from the one in sort operator.

Are these changes tested?

Yes

Are there any user-facing changes?

No

kazuyukitanimura · 2023-08-24T17:23:04Z

Waiting for #7399 for the fmt issues

alamb · 2023-08-24T17:25:45Z

Thank you @kazuyukitanimura -- I plan to review this PR tomorrow. I am very excited!

yjshen

Thanks @kazuyukitanimura for working on this! This is one of the last few missing cornerstones in DataFusion. Cheers!

datafusion/core/src/physical_plan/aggregates/mod.rs

datafusion/core/src/physical_plan/aggregates/row_hash.rs

kazuyukitanimura · 2023-08-24T20:58:44Z

Just FYI, I will be traveling entire next week. I may not be able to respond in time, but I plan to address comments the week after (9/5~)
Thank you in advance.
cc @sunchao @viirya

alamb

Thank you very much @kazuyukitanimura -- I think this is really nicely written. I left some comments for your consideration. I think we could simplify some of the traits and memory accounting and still keep most of the benefits of this PR.

I also would love to test this myself locally, but it occurs to me I can't actually do so because datafusion CLI doesn't have any way to do it. I'll file a ticket

From my perspective the only things that would be required for merging this PR would be:

Performance tests showing it doesn't slow down performance (I don't expect it to and I could help with this)
A few more tests that exercise the spill path after aggregating more than one batch of data (I have left a suggestion on how to do that with FairSpillPool

Again, really nice work and thank you very much for the contribution

datafusion/core/src/physical_plan/aggregates/group_values/mod.rs

datafusion/core/src/physical_plan/aggregates/mod.rs

datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs

datafusion/core/src/physical_plan/aggregates/row_hash.rs

kazuyukitanimura · 2023-08-25T21:04:38Z

Thank you @alamb for reviewing. I plan to work on addressing comments a week after next (9/5~) once I am back from my travel.

Performance tests showing it doesn't slow down performance (I don't expect it to and I could help with this)

Regarding the benchmark, what would be the best way to proceed? I read https://github.com/apache/arrow-datafusion/blob/main/benchmarks/README.md
Is bench.sh run tpch a good one for this PR purpose? I am wondering if there is a github action / CI so that we can use the same machine for benchmarking...

alamb · 2023-08-26T10:56:08Z

Is bench.sh run tpch a good one for this PR purpose? I am wondering if there is a github action / CI so that we can use the same machine for benchmarking...

This is what I recommend.

I have some scripts in https://github.com/alamb/datafusion-benchmarking that I used to compare a branch to main, but they aren't super user friendly at the moment. I'll run this branch on a gcp machine and report back

Thanks again

alamb · 2023-08-29T11:40:40Z

Marking as draft to signify this PR has gotten feedback and is waiting to incorporate it before subsequent review

datafusion/core/src/physical_plan/aggregates/row_hash.rs

datafusion/physical-expr/src/aggregate/first_last.rs

kazuyukitanimura · 2023-09-15T01:31:44Z

Thank you all for the reviews! I think addressed all of them. @alamb @yjshen @sunchao @viirya
There are some TODOs I plan to follow up in the future PRs. For any further improvements on this PR, I plan to tack on the next PR.

alamb · 2023-09-15T11:10:05Z

Thank you all for the reviews! I think addressed all of them. @alamb @yjshen @sunchao @viirya
There are some TODOs I plan to follow up in the future PRs. For any further improvements on this PR, I plan to tack on the next PR.

Thank you @kazuyukitanimura - this is a really nice piece of technology, as well as a nice example of collaboration. I agree let's merge this and continue working on main / follow on PRs.

I filed #7571 to track adding spilling group benchmarks

alamb · 2023-09-15T13:22:23Z

QQuery 15

What's the reason behind QQuery 15 +2.36x faster?

@Dandandan I believe the reason is that Q15 is so fast (15 ms!) it is susceptible to small pturbations, and the difference is largely a measurement error:

│ QQuery 15 │ 13.57ms │ 14.29ms │ 1.05x slower │

kazuyukitanimura · 2023-09-15T17:48:41Z

Great, thank you all again.

jayzhan211 · 2024-09-17T09:36:57Z

datafusion/core/src/physical_plan/aggregates/mod.rs

-        ];
+        let expected = if spill {
+            vec![
+                "+---+-----+-----------------+",


@kazuyukitanimura Hi, do you remember why the result of spill is different from the non-spill one?

I think the reason is that we are comparing the output of a partial aggregate and when spill we also have a lower desired batch size and hit the emit early logic:

datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs

Lines 934 to 945 in 5b6b404

fn emit_early_if_necessary(&mut self) -> Result<()> {

if self.group_values.len() >= self.batch_size

&& matches!(self.group_ordering, GroupOrdering::None)

&& matches!(self.mode, AggregateMode::Partial)

&& self.update_memory_reservation().is_err()

{

let n = self.group_values.len() / self.batch_size * self.batch_size;

let batch = self.emit(EmitTo::First(n), false)?;

self.exec_state = ExecutionState::ProducingOutput(batch);

}

Ok(())

}

That's right. When spilling happens, it means we don't have enough memory to hold original batch so we will do partial aggregation on smaller batch, different partial aggregation result will be but I think it doesn't change final aggregation result.

Support spilling for hash aggregation

9961558

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Aug 24, 2023

kazuyukitanimura mentioned this pull request Aug 24, 2023

Memory Limited GroupBy (Externalized / Spill) #1570

Closed

4 tasks

kazuyukitanimura added 2 commits August 24, 2023 11:32

Merge remote-tracking branch 'upstream/main' into agg-spill-pr

d6a849f

clippy

5dfa345

yjshen reviewed Aug 24, 2023

View reviewed changes

datafusion/core/src/physical_plan/aggregates/mod.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_plan/aggregates/row_hash.rs Show resolved Hide resolved

datafusion/core/src/physical_plan/aggregates/row_hash.rs Outdated Show resolved Hide resolved

alamb reviewed Aug 25, 2023

View reviewed changes

alamb mentioned this pull request Aug 25, 2023

Add memory pool configuration to datafusion-cli #7419

Closed

alamb marked this pull request as draft August 29, 2023 11:40

kazuyukitanimura added 3 commits September 5, 2023 13:34

Merge remote-tracking branch 'upstream/main' into agg-spill-pr

927cb85

address review comments

ea38ad5

address review comments

0863e81

github-actions bot removed the physical-expr Physical Expressions label Sep 7, 2023

address review comments

9068e5f

github-actions bot added the physical-expr Physical Expressions label Sep 8, 2023

kazuyukitanimura added 7 commits September 8, 2023 04:26

address review comments

4b195f6

address review comments

985a90c

address review comments

d9a77c8

address review comments

aa1fc50

address review comments

6646776

address review comments

b636a5e

Merge remote-tracking branch 'upstream/main' into agg-spill-pr

b7d1a59

kazuyukitanimura marked this pull request as ready for review September 12, 2023 10:22