Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations #30302

Closed

Conversation

prakharjain09
Copy link
Contributor

@prakharjain09 prakharjain09 commented Nov 9, 2020

What changes were proposed in this pull request?

This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering.

Example: consider this join of three tables:

  """
    |SELECT t2id, t3.id as t3id
    |FROM (
    |    SELECT t1.id as t1id, t2.id as t2id
    |    FROM t1, t2
    |    WHERE t1.id = t2.id
    |) t12, t3
    |WHERE t1id = t3.id
  """.

The plan for this looks like:

  *(8) Project [t2id#1059L, id#1004L AS t3id#1060L]
  +- *(8) SortMergeJoin [t2id#1059L], [id#1004L], Inner
     :- *(5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0         <-----------------------------
     :  +- *(5) Project [id#1000L AS t2id#1059L]
     :     +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
     :        :- *(2) Sort [id#996L ASC NULLS FIRST ], false, 0
     :        :  +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426]
     :        :     +- *(1) Range (0, 10, step=1, splits=2)
     :        +- *(4) Sort [id#1000L ASC NULLS FIRST ], false, 0
     :           +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432]
     :              +- *(3) Range (0, 20, step=1, splits=2)
     +- *(7) Sort [id#1004L ASC NULLS FIRST ], false, 0
        +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443]
           +- *(6) Range (0, 30, step=1, splits=2)

In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin.

Why are the changes needed?

To remove unneeded Sort operators.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT added.

@github-actions github-actions bot added the SQL label Nov 9, 2020
@maropu
Copy link
Member

maropu commented Nov 10, 2020

This is related to #30300 ?

@prakharjain09 prakharjain09 changed the title [WIP][SPARK-33400][SQL] Avoid unnecessary Sorts in Physical planning when there are aliases in Projects [SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations Nov 17, 2020
@prakharjain09
Copy link
Contributor Author

@maropu I have updated the pull request to cover one scenario which was not covered by #30300 . Please review the changes.

@prakharjain09
Copy link
Contributor Author

cc - @cloud-fan @imback82

@maropu
Copy link
Member

maropu commented Nov 17, 2020

ok to test

@maropu
Copy link
Member

maropu commented Nov 17, 2020

LGTM if the existing tests pass.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35839/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35839/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131235 has finished for PR 30302 at commit 997f98c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SubExprEvaluationRuntime(cacheMaxEntries: Int)
  • case class ExpressionProxy(
  • case class ResultProxy(result: Any)

@prakharjain09
Copy link
Contributor Author

@maropu @cloud-fan Thanks for reviewing the changes. Please merge the changes if there are no further comments.

I will work on a followup PR to make sameOrderExpressions child of SortOrder.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 0b0fb70 Nov 19, 2020
maropu pushed a commit that referenced this pull request Dec 1, 2020
### What changes were proposed in this pull request?
This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 .

### Why are the changes needed?
sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs

Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor.

Authored-by: Prakhar Jain <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants