Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950
[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950
Changes from 3 commits
f418714
98843dd
1b567e7
76509b3
43eb50d
4bf4dc2
9bfafc7
c2c01e4
4990375
e8f18f8
58e71d8
bbaae3e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could extend to other cases like
case p @ Project(_, agg: Aggregate)
, but leave it untouched for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot share the code between this and
moreThanMaxAllowedCommonOutput
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two places looks similar however the parameters are slightly different. We can make them share same code, but the code lines are just few and refactoring needs more change, so seems not worth to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation? It seems that there is one more space here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, you may want to move
||
into line 157.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set this value to 1, all the existing tests can pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not. We might have at lease few common expressions in collapsed projection. If set to 1, any duplicated expression is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a common expression
->common input attributes
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
the maximum allowed number of a common expression can be collapsed into upper Project from lower Project ...
=>the maximum allowed number of common input attributes when collapsing adjacent Projects ...
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly the same, but I revised the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a comment) Even if we set
spark.sql.optimizer.excludedRules
toCollapseProject
, it seems like Spark still respects this value inScanOperation
? That behaviour might be okay, but it looks a bit weird to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but currently if we exclude
CollapseProject
,ScanOperation
will work and collapse projections. Maybe update this doc?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I see. Yea, updating the doc sounds nice to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. Is there a reason to choose
20
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just decide a number that seems bad for repeating an expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, can we introduce this configuration with
Int.MaxValue
in3.1.0
first? We can reduce it later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It is safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation? Maybe, the following is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the dataset API, then it would be very common to chain
withColumn
calls:In that case the query should look more like this:
The
CollapseProject
rule usestransformUp
. It seems that in that case we do not get the expected results from this optimization.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems can be fixed by using
transformDown
instead? Seems to meCollapseProject
is not necessarily to usetransformUp
if I don't miss anything. cc @cloud-fan @maropuThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a chain of projects:
P1(P2(P3(P4(...))))
, then usingtransformDown
will firstly mergeP1
andP2
intoP12
and then it will go to its childP3
and merge it withP4
intoP34
. Only on the second iteration it will merge all 4 of these.In this case we want to merge
P123
and then see, that we can't merge withP4
because we would exceedmaxCommonExprsInCollapseProject
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, that correct way would be using
transformDown
in a similar manner torecursiveRemoveSort
in #21072.So basically when you hit the first
Project
, then you collect all consecutiveProjects
until you hit themaxCommonExprsInCollapseProject
limit and merge them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, it sounds fine, too. Rather, it seems a top-down transformation can collapse projects in one shot just like
RemoveRedundantProjects
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we need to change to
transformDown
and take a recursive approach likeRemoveRedundantProjects
andrecursiveRemoveSort
for collapsingProject
.