-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine the size() calculation of accumulator #5904
Conversation
@@ -266,7 +266,7 @@ impl Accumulator for SumAccumulator { | |||
} | |||
|
|||
fn size(&self) -> usize { | |||
std::mem::size_of_val(self) - std::mem::size_of_val(&self.sum) + self.sum.size() | |||
std::mem::size_of_val(self) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this size change after the AvgAccumulator /SumAccumulator struct is initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest rename the size
method to size_in_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not. Option is of Enum type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the flamegraph, it seems it's not a bottleneck now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs
/// Allocated size required for this accumulator, in bytes, including `Self`.
/// Allocated means that for internal containers such as `Vec`, the `capacity` should be used
/// not the `len`
fn size(&self) -> usize;
The change in this PR seems to avoid extra allocations in ScalarValue (such as ScalarValue::Utf8
which has an allocated string in it)
.fold(acc, |acc, accumulator| acc + accumulator.size()) | ||
}) | ||
} | ||
|
||
/// The state that is built for each output group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic is quite complex to collect the memory size of the accumulators, maybe the computation is more than the real useful aggregations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree the calculation for the memory size is complicated.
I think @tustvold has been thinking about how to improve performance in this area, but I am not sure how far he has gotten
In general, managing individual allocations (and then accounting for their sizes) for each group is a significant additional overhead for grouping.
Hi @Dandandan, @alamb, could you help review this PR? |
I will try and review this carefully later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yahoNanJing
The idea of caching the accumulator sizes seems like a good one.
I am not sure about the changes to Sum
and Avg
accumulators
Maybe we need to take a step back and figure out how to improve grouping performance more holistically rather than trying to do a special optimization on allocation accounting 🤔
For example #4973
@@ -266,7 +266,7 @@ impl Accumulator for SumAccumulator { | |||
} | |||
|
|||
fn size(&self) -> usize { | |||
std::mem::size_of_val(self) - std::mem::size_of_val(&self.sum) + self.sum.size() | |||
std::mem::size_of_val(self) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs
/// Allocated size required for this accumulator, in bytes, including `Self`.
/// Allocated means that for internal containers such as `Vec`, the `capacity` should be used
/// not the `len`
fn size(&self) -> usize;
The change in this PR seems to avoid extra allocations in ScalarValue (such as ScalarValue::Utf8
which has an allocated string in it)
}) | ||
})?; | ||
} | ||
AggregateMode::FinalPartitioned | AggregateMode::Final => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe github is rendering the diff confusingly, but this seems like a significant amount of new code
.fold(acc, |acc, accumulator| acc + accumulator.size()) | ||
}) | ||
} | ||
|
||
/// The state that is built for each output group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree the calculation for the memory size is complicated.
I think @tustvold has been thinking about how to improve performance in this area, but I am not sure how far he has gotten
In general, managing individual allocations (and then accounting for their sizes) for each group is a significant additional overhead for grouping.
Thanks @alamb for your comments. Just refactored the code based on latest main branch code. |
I am trying to clean up outstanding PRs and I came across this one. What shall we do with it @yahoNanJing -- should we pursue getting it merged? |
Marking as Draft until we come to a consensus on what to do with this PR (so it is not on the review list) |
Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it. |
Which issue does this PR close?
Closes #5903.
Rationale for this change
From the flame graph generated by
CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --no-inline --freq 500 --bin tpch -- benchmark datafusion --path ./data-parquet/ --format parquet --partitions 1 -q 17 --iterations 10
There are around 4% spent on calling the size() of accumulator, which can be improved.
What changes are included in this PR?
ScalarValue
's size().accumulator_set
.After applying this PR, from the flame graph, we can see there's almost no cost on the size() of accumulator. The benchmark result is improved from 11s to 10.5s.
Are these changes tested?
Are there any user-facing changes?