Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant cast to varchar in join clause #21050

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Oct 5, 2023

Description

Rewrite query such as select select * from orders o join customer c on cast(o.custkey as varchar) = cast(c.custkey as varchar) to select select * from orders o join customer c on o.custkey = c.custkey

Motivation and Context

Varchar is expensive to hash, and join with varchar keys can be expensive. For the cases above, removing the cast can reduce the cost of join operation.

Impact

Test in production queries show it decrease both CPU and latency by half.

Test Plan

Unit tests and verifier run

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add an optimization, which removes the redundant cast to varchar expression in join condition. The optimizer is controlled by session property `remove_redundant_cast_to_varchar_in_join` and enabled by default.

Sorry, something went wrong.

@feilong-liu feilong-liu requested a review from a team as a code owner October 5, 2023 22:22
@feilong-liu feilong-liu requested a review from presto-oss October 5, 2023 22:22
@feilong-liu feilong-liu marked this pull request as draft October 5, 2023 22:23
@feilong-liu feilong-liu force-pushed the join_cast_varchar branch 2 times, most recently from 4d20653 to a9ba850 Compare October 6, 2023 22:24
@feilong-liu feilong-liu marked this pull request as ready for review October 6, 2023 22:24
@@ -1253,7 +1257,7 @@ public void testPayloadJoins()
"SELECT l.* FROM (select orderkey+1 as ok1, * from lineitem_map) l left join orders o on (l.ok1 = o.orderkey+1) left join part p on (l.partkey=p.partkey)",
"SELECT l.* FROM (select *, cast(orderkey as int) as ok from lineitem_map) l left join (select *, cast(orderkey as int) as ok from orders) o on (l.ok = o.ok) left join (select *, cast(partkey as int) as pk from part) p on (l.ok=p.pk)",
"with lm as (select quantity + 1 as q1, * from lineitem_map) SELECT l.* FROM (select * from lm) l left join orders o on (l.orderkey = o.orderkey) left join part p on (l.partkey=p.partkey)",

// This is the query which fail when REMOVE_REDUNDANT_CAST_TO_VARCHAR_IN_JOIN set to true
"SELECT l.* FROM lineitem_map l left join orders o on (cast(l.orderkey as varchar) = cast(o.orderkey as varchar)) left join part p on (l.partkey=p.partkey)",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlyublena The optimization added surfaced a bug in payload join optimizer, i.e. when payload join optimization is not enabled, this query runs successfully with the optimization in this PR, but when payload join optimization is enabled, this query will fail with the optimization in this PR.

Copy link
Contributor

@mlyublena mlyublena Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an outstanding PR that fixes a lot of bugs in payload join optimizer: #19904
My guess is that the case you found is already fixed there

.build();
assertUpdate("create table lineitem_map as select *, map(ARRAY[1,3], ARRAY[2,4]) as m1, map(ARRAY[1,3], ARRAY[2,4]) as m2 from lineitem", 60175);
assertUpdate("create table part_map as select *, map(ARRAY[1,3], ARRAY[2,4]) as m3, map(ARRAY[1,3], ARRAY[2,4]) as m4 from part", 2000);
assertQueryWithSameQueryRunner(session, "SELECT l.* FROM lineitem_map l left join orders o on (cast(l.orderkey as varchar) = cast(o.orderkey as varchar)) left join part p on (l.partkey=p.partkey)", disabled);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the query found in payload join test here, make sure that it succeeds with this optimization, with payload join disabled.

* </pre>
* We will rely on optimizations later to remove unnecessary cast (if not used) and identity projection here.
*
* Notice that we do not apply similar optimizations to queries with similar join condition like `cast(bigint as varchar) = varchar`. In general it can be converted to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow this transformation for all non-lossy/deterministic casts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied for cast(bigint/integer as varchar) as varchar has high cost for hash operation. Other cases may not show as big improvement here.

@@ -282,6 +282,8 @@ public class FeaturesConfig

private boolean preProcessMetadataCalls;

private boolean removeRedundantCastToVarcharInJoin = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a session property? isn't this always beneficial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case we need to revert it if any bugs found later.

@feilong-liu feilong-liu force-pushed the join_cast_varchar branch 4 times, most recently from ebffe7d to a558373 Compare October 11, 2023 22:11
{
PlanNode leftInput = context.getLookup().resolve(node.getLeft());
PlanNode rightInput = context.getLookup().resolve(node.getRight());
if (!(leftInput instanceof ProjectNode) || !(rightInput instanceof ProjectNode)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that transformation won't work if the projection happens some levels down the tree?
for example

(select * from (select *, cast(key as varchar) as newkey) where ...) join T on newkey=cast(... as varchar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it will not work. I tried to do it, but it will complicated this rule as we not only need to extract this pattern traversing the tree, but also need to propagate the variables used in the cast expression (it not propagated already) to the join node. Since this rule already covered the query which motivated this optimization, I decided to do it in a simplified way for now.

RowExpression rightAssignment = ((CallExpression) rightProjectAssignment).getArguments().get(0);

if (!leftAssignment.getType().equals(rightAssignment.getType())) {
leftAssignment = castToBigInt(functionAndTypeManager, leftAssignment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to do this if the expressions are already bigints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checked in castToBigInt function, with

        if (rowExpression.getType().equals(BIGINT)) {
            return rowExpression;
        }

Copy link
Contributor

@mlyublena mlyublena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feilong-liu feilong-liu force-pushed the join_cast_varchar branch 2 times, most recently from 0f161cf to 2643e89 Compare October 12, 2023 04:52
@github-actions
Copy link

github-actions bot commented Oct 12, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff c4df72a...47c9b35.

No notifications.

@feilong-liu feilong-liu force-pushed the join_cast_varchar branch 2 times, most recently from a84f670 to ed0da1b Compare October 12, 2023 17:47
@feilong-liu feilong-liu merged commit 8d9dfa9 into prestodb:master Oct 12, 2023
@feilong-liu feilong-liu deleted the join_cast_varchar branch October 12, 2023 19:28
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants