Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-equijoin predicate for EliminateCrossJoin #4866

Closed
wants to merge 7 commits into from

Conversation

ygf11
Copy link
Contributor

@ygf11 ygf11 commented Jan 10, 2023

Which issue does this PR close?

Closes #4844.
Closes #4877

Rationale for this change

Currently datafusion will loss the join filter of inner join when run EliminateCrossJoin rule. Following are query and optimized logical plan:

explain verbose select t1.t1_id,t2.t2_id,t3.t3_id 
                 from t1 
                 inner join t2 on t1.t1_id > t2.t2_id 
                 cross join t3 
                 where t3.t3_int > t1.t1_int and t1.t1_int > t2.t2_int;
    Explain
      Projection: t1.t1_id, t2.t2_id, t3.t3_id
        Filter: t3.t3_int > t1.t1_int
          CrossJoin:
            Filter: t1.t1_int > t2.t2_int
              CrossJoin:
                TableScan: t1 projection=[t1_id, t1_int]
                TableScan: t2 projection=[t2_id, t2_int]
            TableScan: t3 projection=[t3_id, t3_int]",

We can see the t1.t1_id > t2.t2_id(join filter) is lost.

This is because EliminateCrossJoin only consider equijoin predicate.
This pr will rewrite EliminateCrossJoin, and choose the right input of join based on both equijoin and non-equijoin predicate. After this pr, the logical plan will be:

      Projection: t1.t1_id, t2.t2_id, t3.t3_id
        Inner Join:  Filter: t3.t3_int > t1.t1_int
          Inner Join:  Filter: t1.t1_int > t2.t2_int AND t1.t1_id > t2.t2_id
            TableScan: t1 projection=[t1_id, t1_int]
            TableScan: t2 projection=[t2_id, t2_int]
          TableScan: t3 projection=[t3_id, t3_int]

The join filter does not lost.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Jan 10, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 10, 2023
left_input.schema().clone(),
right_input.schema().clone(),
)?;
let predicate_schemas =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should find column compare pair. like t1.a > t2.a or t1.a > t2.a + 1.
If left or right don't contains column, we shouldn't put them into join_filter.

is_valid_join_predicate(expr, &predicate_schemas)?;
if is_join_filter {
join_filters.push(expr.clone());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackwener, this try_fold will filter join filters.

Suppose the possible_join_predicates:

  • t1.a > 10
  • t1.c = t2.c
  • t3.b > t2.b

for t1 join t2, the try_fold will return t1.a > 1 and t1.c = t2.c, because all columns of these exprs are from t1 and t2.

This logic is the same as you comment.

I will improve the code readability after the CI fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the try_fold result is not correct, t1.a > 1 should not be treated as join filters(Inner join/Cross join cases).

" TableScan: lineitem projection=[l_partkey, l_quantity], partial_filters=[lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",
" Filter: (part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)) AND part.p_size >= Int32(1) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
" TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[part.p_size >= Int32(1), part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
" Inner Join: lineitem.l_partkey = part.p_partkey Filter: part.p_brand = Utf8(\"Brand#12\") AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15) [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like part.p_brand = Utf8(\"Brand#12\").
left or right don't contain column, should keep in filter

Copy link
Member

@jackwener jackwener Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ygf11 .

This logic is the same as you comment.

Look like it exists problem.

Copy link
Contributor Author

@ygf11 ygf11 Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's different from you comment.

I think both are correct, because the above expression is only related to part(part is one of the input).
But I prefer join filter, because our physical join executor support it, although the above expression will push down.

Do you have any concern? could you explain more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new plan looks not an optimized plan, we should differ join conditions and filter conditions.

) => {
l_op == r_op && ((ll == rl && lr == rr) || (ll == rr && lr == rl))
}
_ => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow 'A < B' eq 'B > A' ?

@ozankabak
Copy link
Contributor

Any thoughts on how to make progress on this?

@ygf11
Copy link
Contributor Author

ygf11 commented Jan 22, 2023

Any thoughts on how to make progress on this?

Thanks for asking @ozankabak . It is blocking by #5022, seems NestedLoopJoin has bug, we need fix it first.

@epsio-banay
Copy link
Contributor

Hey @ygf11, I saw the blocking PR was merged, do you think you'll get back to it soon?
Thanks anyway, this PR is really helpful

@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it.

@alamb alamb closed this Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
6 participants