Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41959][SQL] Improve v1 writes with empty2null #39475

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Jan 10, 2023

What changes were proposed in this pull request?

Cleanup some unnecessary Empty2Null related code

Why are the changes needed?

V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again.

Does this PR introduce any user-facing change?

no

How was this patch tested?

pass CI

@github-actions github-actions bot added the SQL label Jan 10, 2023
@@ -91,32 +92,31 @@ object V1Writes extends Rule[LogicalPlan] with SQLConfHelper {
}

private def prepareQuery(write: V1WriteCommand, query: LogicalPlan): LogicalPlan = {
val hasEmpty2Null = query.exists(p => hasEmptyToNull(p.expressions))
val empty2NullPlan = if (hasEmpty2Null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check has no meaning now since we have already checked WriteFiles for rule idempotency.

// Here we use the original required ordering to compare if matches so that we can preserve the
// user-specified sort ordering with such case:
// `INSERT INTO TABLE t PARTITION(p) SELECT c, p FROM t SORT BY p, c`.
val requiredOrdering = write.requiredOrdering
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key change that use the original required ordering before pulling out Empty2Null

@ulysses-you
Copy link
Contributor Author

cc @cloud-fan @EnricoMi

@cloud-fan
Copy link
Contributor

can you update the test added in #39431 ?

@ulysses-you
Copy link
Contributor Author

can you update the test added in #39431 ?

sure, updated

Copy link
Contributor

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that for the example in #38356 master writes

!WriteFiles
+- *(1) Sort [p#19 ASC NULLS FIRST], false, 0
   +- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
      +- ShuffleQueryStage 0
         +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
            +- LocalTableScan [id#10, sort_col#11, p#12]

while this fix writes

!WriteFiles
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
   +- *(1) Sort [p#12 ASC NULLS FIRST, sort_col#11 ASC NULLS FIRST], false, 0
      +- ShuffleQueryStage 0
         +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
            +- LocalTableScan [id#10, sort_col#11, p#12]

LGTM!

Comment on lines 105 to 109
val requiredOrdering = write.requiredOrdering.map(_.transform {
case a: Attribute => attrMap.getOrElse(a, a)
}.asInstanceOf[SortOrder])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I was half way there: #38356 (comment)
Only adding this back into the else branch below to get to the same Sort was missing in my fix.

@@ -159,7 +159,6 @@ object FileFormatWriter extends Logging {
val actualOrdering = writeFilesOpt.map(_.child)
.getOrElse(materializeAdaptiveSparkPlan(plan))
.outputOrdering.map(_.child)
val orderingMatched = V1WritesUtils.isOrderingMatched(requiredOrdering, actualOrdering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this line doesn't work anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not add the sort in rule V1Writes since the ordering matches. So the actual ordering is the user-specified which does not rewrite the attribute using the new query that wrapped with Empty2Null. But the required ordering has rewritten in rule V1Writes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ordering matches with user-specified:

Project (Empty2Null(p) #2)
  Sort (p#1)

if ordering does not match and V1Write adds a sort:

Sort(p #2)
  Project (Empty2Null(p) #2)

@cloud-fan
Copy link
Contributor

!WriteFiles
+- *(1) Project [id#10, sort_col#11, empty2null(p#12) AS p#19]
   +- *(1) Sort [p#12 ASC NULLS FIRST, sort_col#11 ASC NULLS FIRST], false, 0
      +- ShuffleQueryStage 0
         +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=18]
            +- LocalTableScan [id#10, sort_col#11, p#12]

!WriteFiles means something is wrong. @ulysses-you can you check if the expr id is correctly rewritten?

@ulysses-you
Copy link
Contributor Author

!WriteFiles means something is wrong. @ulysses-you can you check if the expr id is correctly rewritten?

@cloud-fan fix in #39468

val outputOrdering = query.outputOrdering
// Check if the ordering is already matched to ensure the idempotency of the rule.
val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
if (orderingMatched) {
empty2NullPlan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we don't add the sort here, the caller will rewrite the attributes in v1 command which will update required orderng expression as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the root reason in FileFormatWriter why the previous isOrderingMatched does not work when planned write enabled. The attributes in required ordering have been rewritten but the acutal ordering have not.

@EnricoMi
Copy link
Contributor

@cloud-fan @ulysses-you I can confirm with #39468, the plan for the example now looks like

WriteFiles
+- *(1) Project [id#30, sort_col#31, empty2null(p#32) AS p#39]
   +- *(1) Sort [p#32 ASC NULLS FIRST, sort_col#31 ASC NULLS FIRST], false, 0
      +- ShuffleQueryStage 0
         +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=68]
            +- LocalTableScan [id#30, sort_col#31, p#32]

@ulysses-you can you please rebase with master or merge master into this branch
@cloud-fan can this go ahead then?

@ulysses-you
Copy link
Contributor Author

@EnricoMi thank you for the confirm ! I send a new approach for this issue see #39556

cloud-fan pushed a commit that referenced this pull request Jan 31, 2023
…and AliasAwareQueryOutputOrdering to take all aliases into account

### What changes were proposed in this pull request?
Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering`
takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling.

### Why are the changes needed?
Performance improvement and this also fix the issue in #39475.

### Does this PR introduce _any_ user-facing change?
Yes, this PR fixes the issue in #39475.

### How was this patch tested?
Added new UT.

Closes #37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression.

Lead-authored-by: Peter Toth <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6341b06)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jan 31, 2023
…and AliasAwareQueryOutputOrdering to take all aliases into account

### What changes were proposed in this pull request?
Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering`
takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling.

### Why are the changes needed?
Performance improvement and this also fix the issue in #39475.

### Does this PR introduce _any_ user-facing change?
Yes, this PR fixes the issue in #39475.

### How was this patch tested?
Added new UT.

Closes #37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression.

Lead-authored-by: Peter Toth <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@ulysses-you
Copy link
Contributor Author

@cloud-fan any comments ? thank you

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4 (code cleanup for planned write)

@cloud-fan cloud-fan closed this in 547737b Feb 20, 2023
cloud-fan pushed a commit that referenced this pull request Feb 20, 2023
### What changes were proposed in this pull request?

Cleanup some unnecessary `Empty2Null` related code

### Why are the changes needed?

V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI

Closes #39475 from ulysses-you/SPARK-41959.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 547737b)
Signed-off-by: Wenchen Fan <[email protected]>
@ulysses-you ulysses-you deleted the SPARK-41959 branch February 21, 2023 01:18
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…and AliasAwareQueryOutputOrdering to take all aliases into account

### What changes were proposed in this pull request?
Currently `AliasAwareOutputPartitioning` and `AliasAwareQueryOutputOrdering`
takes only the last alias by aliased expressions into account. We could avoid some extra shuffles and sorts with better alias handling.

### Why are the changes needed?
Performance improvement and this also fix the issue in apache#39475.

### Does this PR introduce _any_ user-facing change?
Yes, this PR fixes the issue in apache#39475.

### How was this patch tested?
Added new UT.

Closes apache#37525 from peter-toth/SPARK-40086-fix-aliasawareoutputexpression.

Lead-authored-by: Peter Toth <[email protected]>
Co-authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6341b06)
Signed-off-by: Wenchen Fan <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

Cleanup some unnecessary `Empty2Null` related code

### Why are the changes needed?

V1Writes checked idempotency using WriteFiles, so it's unnecessary to check if empty2null if exists again.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI

Closes apache#39475 from ulysses-you/SPARK-41959.

Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 547737b)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants