-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SanityCheckPlan should compare UnionExec inputs to requirements for output (parent). #12414
SanityCheckPlan should compare UnionExec inputs to requirements for output (parent). #12414
Conversation
…re constant columns in sort
b5e60ab
to
a26c49e
Compare
…based upon the Union's parent vs the Union's input
a26c49e
to
782e18d
Compare
Thanks @wiedld for the detailed description of the problem. Once it's ready, I will review it thoroughly. |
When I briefly look at the solution, I actually think we shouldn't patch the sanity checker to relax it. A more accurate solution seems to be your alternate proposal:
You can actually add constants, as there is a flag, " However, after adding constants, I am not sure if |
I agree with @berkaysynnada that we should fix the calculation rather than relax the sanity checker in DataFusion We (already) patch DataFusion to skip the sanity check for certain plan nodes so we aren't blocked downstream. I think we should focus on the "right"fix that allows the sanity checker to run but also correctly recognize the plan is valid |
…ictions based upon the Union's parent vs the Union's input" This reverts commit a50edc3.
caveat: this has an unintended side effect, as the EnforceSorting removes the sort_expr from one input/side of the UnionExec (where it's not constant)
d928d04
to
4bd4db0
Compare
04)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,a@2 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
04)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
05)--------ProjectionExec: expr=[b@1 as b, c@2 as c, a@0 as a, NULL as a0, d@3 as d] | ||
06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b, c, d], output_ordering=[c@2 ASC NULLS LAST], has_header=true | ||
07)------SortExec: TopK(fetch=2), expr=[d@4 ASC NULLS LAST,c@1 ASC NULLS LAST,a0@3 ASC NULLS LAST,b@0 ASC NULLS LAST], preserve_partitioning=[false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually add constants, as there is a flag, "across_partitions," that indicates whether the value is constant across all partitions or only in its corresponding partition
Made the change per suggestion (demonstration only, not final commits), and I'm not sure this is the proper fix. If I add the constants on the union's equivalence properties, there are other ramifications because:
- the EquivalenceProperties::output_ordering remove the constants
- the EquivalenceProperties::normalized_oeq_class remove the constants
In the example above, the sort orders [a@2 ASC NULLS LAST]
and [a0@3 ASC NULLS LAST]
are removed on non-constant projections. The EnforceSorting optimization adds (and pushes down) the SortExecs, but the change itself really occurs based upon the EquivalenceProperties's definition of a (non-constant) sort order. Since the UnionExec listed certain constants -- they are removed from the sort order.
I started hacking around this in the EnforceSorting, but it feels like the suggested change (adding to constants) may be actually changing the definition of what the constants are? 🤔 Am I heading in the right direction here?
I think I misled you unintentionally. The part we need to focus on is datafusion/datafusion/physical-expr/src/equivalence/properties.rs Lines 1559 to 1578 in 3ece7a7
When I think about what we need to obtain as the finest ordering from the union of two children:
Then the union should have: To get this result, we need to keep track of the left and right constants of the children, which are not allowed to be placed in the union's constants. Going back to my example above, these child-specific constants can be treated as:
If we reapply the ordering calculation part I shared above on these updated child properties, we should get what we need: |
I filed #12446 to track this issue |
I pushed up a possible solution, which has a slight change to the test case output ordering as per this commit. Please lmk @berkaysynnada if this is the solution we are seeking. 🙏🏼 |
/// child2 = order by a, b, c | ||
/// => union's joint order is a0, a, b, c. | ||
fn calculate_joint_ordering( | ||
lhs: &EquivalenceProperties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually fully understand why do you need such a function. I think its logic is not correct.
SCHEMA: a-b-c-a0
CHILD1: (5,15,25,35),(4,15,25,36),(3,15,25,37)
CHILD2: (5,15,25,35),(6,15,25,34),(7,15,25,33)
How could we deduce unioned output is (a0, a, b, c)?
// Expected | ||
vec![], | ||
// Expected: union sort orders | ||
vec![vec!["a", "b", "c"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these test changes are not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- these changes make the expected output incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you don't need to add a new logic for sort expr's. You should only focus on the following conversion, when constant expressions appear as the global order of the other children:
LEFT: orderings: [[a]], constants: [c] → orderings: [[a, c], [c, a]], constants: []
RIGHT: orderings: [[c]], constants: [a] → orderings: [[c, a], [a, c]], constants: []
When such an ordering and constant properties exist in children (LEFT: orderings: [[a]], constants: [c], RIGHT: orderings: [[c]], constants: [a]), we should treat them as if the ordering is permutation of existing orderings and the constant.
I think this is due to how I explained the issue. The actual children are:
That comes out of this query reproducer, where we have two fields a and a0, which are nulls on alternating sides:
I added to the complexity by trying to handle a generalized case, and misusing the |
I looked at this problem some more (not really the code, but just a smaller reproducer) and left my notes here #12446 (comment) and here #12446 (comment). TLDR is it is not clear to me the issue is with the union properties calculation 🤔 |
I plan to take over this PR / fix (likely tomorrow) -- I will keep #12446 updated as well |
Which issue does this PR close?
Resolves #12446
We have failures caused by SanityCheckPlan not considering the constants in the UnionExec input orderings. We made an exact reproducer in the first commit's test case.
Rationale for this change
Given the following scenario:
The SanityCheckPlan was failing, because the UnionExec has the following orderings vs constants:
which means that no single ordering (minus constants) could fulfill the
[proj1 sorted, proj2 sorted]
sort requirement. Because the UnionExec has it's input orderings but not it's input constants.What changes are included in this PR?
I elected to perform the comparisons (during the SanityPlanCheck) between the UnionExec's parent and children, skipping the union exec itself.
Alternatively, I could have elected to:
Are these changes tested?
Yes.
Are there any user-facing changes?
No.