-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41959][SQL] Improve v1 writes with empty2null #39475
Conversation
@@ -91,32 +92,31 @@ object V1Writes extends Rule[LogicalPlan] with SQLConfHelper { | |||
} | |||
|
|||
private def prepareQuery(write: V1WriteCommand, query: LogicalPlan): LogicalPlan = { | |||
val hasEmpty2Null = query.exists(p => hasEmptyToNull(p.expressions)) | |||
val empty2NullPlan = if (hasEmpty2Null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check has no meaning now since we have already checked WriteFiles
for rule idempotency.
// Here we use the original required ordering to compare if matches so that we can preserve the | ||
// user-specified sort ordering with such case: | ||
// `INSERT INTO TABLE t PARTITION(p) SELECT c, p FROM t SORT BY p, c`. | ||
val requiredOrdering = write.requiredOrdering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key change that use the original required ordering before pulling out Empty2Null
can you update the test added in #39431 ? |
82e20f1
to
f14183c
Compare
sure, updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that for the example in #38356 master writes
!WriteFiles
+- *(1) Sort [p#19 ASC NULLS FIRST], false, 0
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
+- ShuffleQueryStage 0
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
+- LocalTableScan [id#10, sort_col#11, p#12]
while this fix writes
!WriteFiles
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
+- *(1) Sort [p#12 ASC NULLS FIRST, sort_col#11 ASC NULLS FIRST], false, 0
+- ShuffleQueryStage 0
+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
+- LocalTableScan [id#10, sort_col#11, p#12]
LGTM!
val requiredOrdering = write.requiredOrdering.map(_.transform { | ||
case a: Attribute => attrMap.getOrElse(a, a) | ||
}.asInstanceOf[SortOrder]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I was half way there: #38356 (comment)
Only adding this back into the else
branch below to get to the same Sort
was missing in my fix.
@@ -159,7 +159,6 @@ object FileFormatWriter extends Logging { | |||
val actualOrdering = writeFilesOpt.map(_.child) | |||
.getOrElse(materializeAdaptiveSparkPlan(plan)) | |||
.outputOrdering.map(_.child) | |||
val orderingMatched = V1WritesUtils.isOrderingMatched(requiredOrdering, actualOrdering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this line doesn't work anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not add the sort in rule V1Writes
since the ordering matches. So the actual ordering is the user-specified which does not rewrite the attribute using the new query that wrapped with Empty2Null
. But the required ordering has rewritten in rule V1Writes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ordering matches with user-specified:
Project (Empty2Null(p) #2)
Sort (p#1)
if ordering does not match and V1Write adds a sort:
Sort(p #2)
Project (Empty2Null(p) #2)
|
@cloud-fan fix in #39468 |
val outputOrdering = query.outputOrdering | ||
// Check if the ordering is already matched to ensure the idempotency of the rule. | ||
val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering) | ||
if (orderingMatched) { | ||
empty2NullPlan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we don't add the sort here, the caller will rewrite the attributes in v1 command which will update required orderng expression as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the root reason in FileFormatWriter
why the previous isOrderingMatched
does not work when planned write enabled. The attributes in required ordering have been rewritten but the acutal ordering have not.
@cloud-fan @ulysses-you I can confirm with #39468, the plan for the example now looks like
@ulysses-you can you please rebase with master or merge master into this branch |
…and AliasAwareQueryOutputOrdering to take all aliases into account ### What changes were proposed in this pull request? Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering` takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling. ### Why are the changes needed? Performance improvement and this also fix the issue in #39475. ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes the issue in #39475. ### How was this patch tested? Added new UT. Closes #37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression. Lead-authored-by: Peter Toth <[email protected]> Co-authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6341b06) Signed-off-by: Wenchen Fan <[email protected]>
…and AliasAwareQueryOutputOrdering to take all aliases into account ### What changes were proposed in this pull request? Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering` takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling. ### Why are the changes needed? Performance improvement and this also fix the issue in #39475. ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes the issue in #39475. ### How was this patch tested? Added new UT. Closes #37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression. Lead-authored-by: Peter Toth <[email protected]> Co-authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
f14183c
to
099ac9f
Compare
@cloud-fan any comments ? thank you |
thanks, merging to master/3.4 (code cleanup for planned write) |
### What changes were proposed in this pull request? Cleanup some unnecessary `Empty2Null` related code ### Why are the changes needed? V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes #39475 from ulysses-you/SPARK-41959. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 547737b) Signed-off-by: Wenchen Fan <[email protected]>
…and AliasAwareQueryOutputOrdering to take all aliases into account ### What changes were proposed in this pull request? Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering` takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling. ### Why are the changes needed? Performance improvement and this also fix the issue in apache#39475. ### Does this PR introduce _any_ user-facing change? Yes, this PR fixes the issue in apache#39475. ### How was this patch tested? Added new UT. Closes apache#37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression. Lead-authored-by: Peter Toth <[email protected]> Co-authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6341b06) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Cleanup some unnecessary `Empty2Null` related code ### Why are the changes needed? V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI Closes apache#39475 from ulysses-you/SPARK-41959. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 547737b) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Cleanup some unnecessary
Empty2Null
related codeWhy are the changes needed?
V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again.
Does this PR introduce any user-facing change?
no
How was this patch tested?
pass CI