Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and reduce runtime type coercion in aggregates like sum #2447

Closed
Tracked by #2355
alamb opened this issue May 5, 2022 · 6 comments
Closed
Tracked by #2355

Investigate and reduce runtime type coercion in aggregates like sum #2447

alamb opened this issue May 5, 2022 · 6 comments
Assignees
Labels
datafusion Changes in the datafusion crate enhancement New feature or request help wanted Extra attention is needed

Comments

@alamb
Copy link
Contributor

alamb commented May 5, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There is casting logic in aggregates that handles coercing inputs to aggregates

https://github.com/apache/arrow-datafusion/blob/6b4bbd0/datafusion/physical-expr/src/aggregate/sum.rs#L224-L316

On the surface doing these casts in sum.rs appears to duplicates some non trivial amount of the logic in plan timecoercion -- maybe it would be possible to make this code cleaner / consolidate more of the coercion logic.

Describe the solution you'd like
Ensure types are known prior to executing the aggregate so that the input and aggregate types are known aprior

Describe alternatives you've considered
Not sure (maybe the code is needed, it just "feels" a bit wrong)

Additional context

#2405 (comment)
cc @WinkerDu

@alamb alamb added enhancement New feature or request help wanted Extra attention is needed datafusion Changes in the datafusion crate labels May 5, 2022
@comphead
Copy link
Contributor

comphead commented May 6, 2022

@alamb I can take this if you dont mind

@andygrove
Copy link
Member

I added this to #2355

@comphead
Copy link
Contributor

comphead commented May 8, 2022

the same pattern used in min_max.rs.
The common approach to get the underlying value from ScalarValue is through patmatch, and that causes a lot of similar code branches to operate with underlying values supporting different datatypes.

I'm trying to figure out if sum_batch function can be reused instead of boilerplate sum. In fact sum is specific case of sum_batch

@alamb
Copy link
Contributor Author

alamb commented May 8, 2022

I wonder if we can take inspiration from @yjshen 's approach in #2375 (using row formats, etc). I don't have any specific suggestions but thought it might be interesting to look

@comphead
Copy link
Contributor

@alamb sorry for delay, I stuck investigating all that datatypes.
Please check the draft #2516
So idea is to reuse sum_batch function instead of boilerplate sum.
I have to do type casting to ensure all types are the same. it might be looking a concern for the performance. Please let me know your thoughts

@alamb
Copy link
Contributor Author

alamb commented Jun 11, 2023

I think this I think this ticket is no longer tracking anything actionable -- I expect more performance improvements from #4973

@alamb alamb closed this as completed Jun 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants