-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CommonSubexprEliminate
faster by stop copying so many strings
#10426
Comments
I'm happy to take this. |
Are there any potential issues with simply using the existing Serveral other optimization passes use string names as keys for expressions in data structures. I am wondering if any of these could also be refactored to simply use synthetic group by expressions for aggregates: datafusion/datafusion/expr/src/logical_plan/builder.rs Lines 1246 to 1270 in accce97
functional dependencies heavily uses decorrelate:
push down filter for aggregates: datafusion/datafusion/optimizer/src/push_down_filter.rs Lines 788 to 837 in accce97
single distinct to group by: datafusion/datafusion/optimizer/src/single_distinct_to_groupby.rs Lines 69 to 96 in accce97
|
Thanks for these references @erratic-pattern. Background and general thoughts:I'm only familiar with CSE code and in its case unfortunately non-unique stringified expression were used as keys of the map that stores the occurrance counts. This bug was introduced in #9871 and reverted in #10396. The issue with these colliding string keys are explained here in details: #10333 (comment). Some thougths about CSE:After #10396 we still use stringified expressions as keys ( In case of CSE we could use We could also use
This issue can be solved by adding new Despite the fact that my WIP commit adds new APIs, I would prefer and lean towards option b.. But since I'm only aware of this ticket that requires this change to the APIs, I haven't opened the PR yet. Now there is another thing to consider if we want use
This is very nice, because the second, rewriting traversal can use the preorder visit index again to look up the identifier first and then the count from the
In my work in progress change for this issue I would like to finalize the:
Back to the original question of using
|
Thanks for the detailed write up @peter-toth . Though I did mention I like the idea of generalizing the
Anyway, I don't want to over-abstract just yet, so for now just build something that works for CSE and then we can take it and see if it can be applied to any of the other optimizations. I am curious if overriding |
Honestly, I don't know those referenced usecases, but I feel Anyways, I will try to open the PR with it next week and then feel free to generalize the idea for other usecases if it makes sense. |
I've opened a draft PR: #10473 and will try to wrap it up in the following days. |
Here is one example API that I would love to implement with such a tree-node api: #10505 I also ran into an example when trying to find embedded datafusion/datafusion/optimizer/src/scalar_subquery_to_join.rs Lines 54 to 68 in 424757f
|
Thanks for sharing this @alamb. It's good to know that there are other possible usecases for this new API. #10473 seems to pass all tests now. I will extract the first commit of it into a separate PR today or tomorrow to add the new TreeNode API. |
FWIW I've also seen the high cost of expression string formatting (using I think there's nothing wrong about having a "display" infrastructure, but it shouldn't be used eagerly. As others pointed out, using a hash or any form of numeric ID would probably be better in many places. |
100% -- btw #10454 from @erratic-pattern made this code faster (fewer allocations) though it would be better still as you point out to not use display as much. I will say from personal experience working on postgres / postgres derived systems (which does use a numeric id to identify columns), using strings is much easier to debug when problems occur. I do think we can reduce it significantly however |
@crepererum I am working on moving away from string allocations in a number of the optimization rules and switching to Most of these use the I would be interested in seeing the profile data you mentioned, especially those that use Also since we're no longer talking about strictly |
I can see if I get organize you some profiles next week 🙂 |
CommonSubexprEliminate
faster by avoiding the use of stringsCommonSubexprEliminate
faster by stopping the use of strings
CommonSubexprEliminate
faster by stopping the use of stringsCommonSubexprEliminate
faster by stop copying so many strings
Is your feature request related to a problem or challenge?
Part of #5637
One of the optimizer passes is "common subexpression elimination" that removes redundant computation
However, as @peter-toth noted on #10396 and the CSE code says
datafusion/datafusion/optimizer/src/common_subexpr_eliminate.rs
Lines 108 to 119 in d58bae4
The way it tracks common subexpressions is with string manipulation is is non ideal for several reasons (including the cost of creating those strings)
Describe the solution you'd like
Revisit the identifiers as using these string identifiers as the keys of
ExprStats
was not the best choice. Please note this is how CSE has been working since the feature was added initially.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: