Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42049][SQL][FOLLOWUP] Always filter away invalid ordering/partitioning #40137

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 23, 2023

What changes were proposed in this pull request?

This is a follow-up of #37525 . When the project list has aliases, we go to the projectExpression branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @peter-toth

@github-actions github-actions bot added the SQL label Feb 23, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sunchao , @huaxingao , @viirya , too.

@@ -114,7 +114,11 @@ trait AliasAwareQueryOutputOrdering[T <: QueryPlan[T]]
}
}.takeWhile(_.isDefined).flatten.toSeq
} else {
orderingExpressions
// Make sure the returned ordering are valid (only reference output attributes of the current
Copy link
Contributor

@peter-toth peter-toth Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for this follow-up!

But, I'm not sure this part is correct as sortOrder.children should be filtered with _.references.subsetOf(outputSet) separately.

E.g. this test (is a bit artifical though) fails now:

  test("SPARK-42049: Improve AliasAwareOutputExpression - no alias but still prune expressions 2") {
    withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
      Seq(CollapseProject.ruleName, ColumnPruning.ruleName).mkString(",")) {
      val df = spark.range(2).select($"id" as "a", $"id" as "b").select($"a")
      val outputOrdering = df.queryExecution.optimizedPlan.outputOrdering

      assert(outputOrdering.size == 1)
      assert(outputOrdering.head.child.asInstanceOf[Attribute].name == "a")
      assert(outputOrdering.head.sameOrderExpressions.size == 0)
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@viirya
Copy link
Member

viirya commented Feb 23, 2023

Hmm, the failed test seems a related one.

@dongjoon-hyun
Copy link
Member

Oh is it still failing?

@viirya
Copy link
Member

viirya commented Feb 23, 2023

I think it was failing due to latest commit.

@cloud-fan
Copy link
Contributor Author

The failed test checks invalid ordering and I've updated it.

dongjoon-hyun pushed a commit that referenced this pull request Feb 24, 2023
…itioning

### What changes were proposed in this pull request?

This is a follow-up of #37525 . When the project list has aliases, we go to the `projectExpression` branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

### Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #40137 from cloud-fan/alias.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 72922ad)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/3.4.
Thank you, @cloud-fan , @peter-toth , @viirya .

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…itioning

### What changes were proposed in this pull request?

This is a follow-up of apache#37525 . When the project list has aliases, we go to the `projectExpression` branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

### Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes apache#40137 from cloud-fan/alias.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 72922ad)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants