-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-equijoin predicate for EliminateCrossJoin #4866
Conversation
left_input.schema().clone(), | ||
right_input.schema().clone(), | ||
)?; | ||
let predicate_schemas = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should find column compare pair. like t1.a > t2.a
or t1.a > t2.a + 1
.
If left or right don't contains column, we shouldn't put them into join_filter.
is_valid_join_predicate(expr, &predicate_schemas)?; | ||
if is_join_filter { | ||
join_filters.push(expr.clone()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackwener, this try_fold
will filter join filters.
Suppose the possible_join_predicates
:
- t1.a > 10
- t1.c = t2.c
- t3.b > t2.b
for t1 join t2, the try_fold
will return t1.a > 1
and t1.c = t2.c
, because all columns of these exprs are from t1
and t2
.
This logic is the same as you comment.
I will improve the code readability after the CI
fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the try_fold
result is not correct, t1.a > 1
should not be treated as join filters(Inner join/Cross join cases).
" TableScan: lineitem projection=[l_partkey, l_quantity], partial_filters=[lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]", | ||
" Filter: (part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)) AND part.p_size >= Int32(1) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]", | ||
" TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[part.p_size >= Int32(1), part.p_brand = Utf8(\"Brand#12\") AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND part.p_size <= Int32(15)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]", | ||
" Inner Join: lineitem.l_partkey = part.p_partkey Filter: part.p_brand = Utf8(\"Brand#12\") AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8(\"Brand#23\") AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8(\"Brand#34\") AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15) [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like part.p_brand = Utf8(\"Brand#12\")
.
left or right don't contain column, should keep in filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's different from you comment.
I think both are correct, because the above expression is only related to part
(part
is one of the input).
But I prefer join filter, because our physical join executor support it, although the above expression will push down.
Do you have any concern? could you explain more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new plan looks not an optimized plan, we should differ join conditions and filter conditions.
) => { | ||
l_op == r_op && ((ll == rl && lr == rr) || (ll == rr && lr == rl)) | ||
} | ||
_ => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow 'A < B
' eq 'B > A
' ?
Any thoughts on how to make progress on this? |
Thanks for asking @ozankabak . It is blocking by #5022, seems |
Hey @ygf11, I saw the blocking PR was merged, do you think you'll get back to it soon? |
Since this has been open for more than a year, closing it down. Feel free to reopen if/when you keep working on it. |
Which issue does this PR close?
Closes #4844.
Closes #4877
Rationale for this change
Currently datafusion will loss the
join filter
of inner join when runEliminateCrossJoin
rule. Following are query and optimized logical plan:We can see the
t1.t1_id > t2.t2_id
(join filter) is lost.This is because
EliminateCrossJoin
only consider equijoin predicate.This pr will rewrite
EliminateCrossJoin
, and choose the right input of join based on both equijoin and non-equijoin predicate. After this pr, the logical plan will be:The join filter does not lost.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?