Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend payload join optimization for cases without map columns; fix b… #19904

Merged

Conversation

mlyublena
Copy link
Contributor

@mlyublena mlyublena commented Jun 16, 2023

A collection of bug fixes for the payload join optimization, including the following:

  • fixed cases where not all collected join keys belong to the left-most table
  • abort optimization if the rewrite hides required columns
  • abort rewrite for operators other than Select/Project/Scan/Join. This fixes wrong result bugs when the query contains aggregates and other operators that may change the cardinality of intermediate levels
  • remove restriction for the fact table to have map or array columns. In the future we'll incorporate HBO to make a cost-based decision when to apply this optimization

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from b93b8ec to 42a152a Compare June 20, 2023 18:18
@mlyublena mlyublena marked this pull request as ready for review June 20, 2023 18:50
@mlyublena mlyublena requested a review from a team as a code owner June 20, 2023 18:50
@mlyublena mlyublena requested a review from presto-oss June 20, 2023 18:50
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 3 times, most recently from 49a7fb1 to 341c8fb Compare June 21, 2023 19:16
@mlyublena mlyublena requested a review from rschlussel June 21, 2023 19:19
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from 69d1dbb to 7b93ac7 Compare June 22, 2023 23:42
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the second commit. Can you add more context to the commit message about what the problem is/why it's a problem?

Also, can you make sure all of your commit messages follow the guidelines here: https://cbea.ms/git-commit/

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 3 times, most recently from 7b79b50 to 72950f2 Compare June 26, 2023 22:16
@mlyublena mlyublena requested a review from rschlussel June 28, 2023 18:12
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 3 times, most recently from cd731ee to 97f7c45 Compare July 17, 2023 20:47
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the flow with all the context setting a bit confusing (even before this change), so I have a hard time being confident about the change.

It would be good to have more tests for cases that we don't expect the optimizer to kick in (e.g. left join with nested right or inner joins). Also, looks like the tests for correctness just validate the row count, but not the values or even column types. Would be good to have tests do a regular assertQuery correctness test for all the test queries to fully validate the results.

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from f508f23 to 29b2c1c Compare August 7, 2023 20:44
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from 7b33404 to 1e86fea Compare August 23, 2023 22:01
@mlyublena
Copy link
Contributor Author

I find the flow with all the context setting a bit confusing (even before this change), so I have a hard time being confident about the change.

It would be good to have more tests for cases that we don't expect the optimizer to kick in (e.g. left join with nested right or inner joins). Also, looks like the tests for correctness just validate the row count, but not the values or even column types. Would be good to have tests do a regular assertQuery correctness test for all the test queries to fully validate the results.

The logic is basically as follows:

  • when visiting nodes on the way down I'm collecting join keys in the join context until I hit a supported leaf node (SelectProjectScan node)
  • if all join keys are contained are contained in the leaf node, I add an Aggregate on top to compute only the join keys without the payload, and add the expression to reattach back the payload in the context
  • some of the intermediate nodes may require columns from the payload table that are not join keys (and the Aggregate hid them): if that is the case I abort the optimization and try from scratch with a new context. I refactored this logic to use a utility function usedColumns that returns the necessary columns for each node type, and use that in validateUsedColumnsForUnaryNode
  • when I reach back the root of the join stack (isTopJoin = true) I reattach the payload using the IsNotDistinctFrom join. I also reset the context in that case to mark that the payload part has been consumed and doesn't need to get reattached

Regarding testing, I'd prefer to create a verifier suite (like this one that I used to debug the remaining issues: https://www.internalfb.com/intern/presto/verifier/results/?suite=test_payload_0808) as it gets harder and harder to enumerate all combinations of joins and unary operators

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from bc1fffe to e9a49ea Compare August 23, 2023 22:46
@mlyublena
Copy link
Contributor Author

Note: will squash the commits after review and will update the commit description

@mlyublena mlyublena requested a review from rschlussel August 30, 2023 23:24
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the intermediate nodes may require columns from the payload table that are not join keys (and the Aggregate hid them): if that is the case I abort the optimization and try from scratch with a new context. I refactored this logic to use a utility function usedColumns that returns the necessary columns for each node type, and use that in validateUsedColumnsForUnaryNode

I wonder if it's possible to have the initial rewrite take into account the required columns to begin with. It seems not ideal that the rewrite can produce a "wrong" plan that we need to undo afterwards.

Regarding testing, I'd prefer to create a verifier suite (like this one that I used to debug the remaining issues: https://www.internalfb.com/intern/presto/verifier/results/?suite=test_payload_0808) as it gets harder and harder to enumerate all combinations of joins and unary operators

The verifier suite is great, but it's only available internally at Meta, and also doesn't run automatically with every PR as someone develops. Of course it's impossible to be 100% comprehensive, but we do also need sufficient correctness tests in the codebase itself (of both the plan and the results).

@mlyublena
Copy link
Contributor Author

some of the intermediate nodes may require columns from the payload table that are not join keys (and the Aggregate hid them): if that is the case I abort the optimization and try from scratch with a new context. I refactored this logic to use a utility function usedColumns that returns the necessary columns for each node type, and use that in validateUsedColumnsForUnaryNode

I wonder if it's possible to have the initial rewrite take into account the required columns to begin with. It seems not ideal that the rewrite can produce a "wrong" plan that we need to undo afterwards.

during top-down traversal when we explore a node we might not have encountered all the join keys yet. So we won't know if a column used in an operator is a join key from below, or is not a join key and will be hidden by the rewrite.
Consider for example the following query: Join(Filter A=5 (join(join(R,S)), T)
When we explore the filter node we know what the top level join keys are but not what the join keys of the children are as we haven't traversed them yet. So we won't know if A is a key or not.

@rschlussel
Copy link
Contributor

during top-down traversal when we explore a node we might not have encountered all the join keys yet. So we won't know if a column used in an operator is a join key from below, or is not a join key and will be hidden by the rewrite.
Consider for example the following query: Join(Filter A=5 (join(join(R,S)), T)
When we explore the filter node we know what the top level join keys are but not what the join keys of the children are as we haven't traversed them yet. So we won't know if A is a key or not.

But we'll know when we explore the child joins that A is a needed output column for a later node, so we should be able to choose not to rewrite at that point.

@mlyublena
Copy link
Contributor Author

during top-down traversal when we explore a node we might not have encountered all the join keys yet. So we won't know if a column used in an operator is a join key from below, or is not a join key and will be hidden by the rewrite.
Consider for example the following query: Join(Filter A=5 (join(join(R,S)), T)
When we explore the filter node we know what the top level join keys are but not what the join keys of the children are as we haven't traversed them yet. So we won't know if A is a key or not.

But we'll know when we explore the child joins that A is a needed output column for a later node, so we should be able to choose not to rewrite at that point.

That's a great point - I ran into this in another optimization where I wanted to know the required columns from the top node (and that one was a rule-based optimization so things were even harder). I think it's worth having built-in support for passing required properties (such as required columns) from parent operators and making it a first-class citizen in the optimizer.

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from c3583ee to 2bc6828 Compare September 13, 2023 16:45
@mlyublena
Copy link
Contributor Author

It would be good to have more tests for cases that we don't expect the optimizer to kick in (e.g. left join with nested right or inner joins). Also, looks like the tests for correctness just validate the row count, but not the values or even column types. Would be good to have tests do a regular assertQuery correctness test for all the test queries to fully validate the results.

Added more tests to cover for the various operators where we abort the transformation. Also added assertQuery tests to check for correctness of the result

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from 2bc6828 to 813e033 Compare September 13, 2023 17:18
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from 813e033 to a565008 Compare September 22, 2023 19:12
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from a565008 to 6c8e6b3 Compare October 2, 2023 18:56
@mlyublena mlyublena requested a review from pranjalssh October 2, 2023 18:58
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from 6c8e6b3 to 6e4674a Compare October 2, 2023 19:04
{
return context.defaultRewrite(node, context.get());
// for binary and n-ary operators, do a default rewrite with a new context for each child to avoid lateral propagation of context information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says for binary and n-ary, but the code would do the same for unary operators as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recently fixed that, forgot to update/remove the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -481,6 +513,60 @@ private boolean supportedJoinKeyTypes(Set<VariableReferenceExpression> joinKeys)
{
return joinKeys.stream().allMatch(key -> key.getType() instanceof VarcharType || isNumericType(key.getType()));
}

// TODO: move this as a member function of PlanNode
private static Set<VariableReferenceExpression> usedColumns(PlanNode plan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is only called when plan is FilterNode. Why do we need to handle other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I narrowed down the supported cases to only filter/project/scan + join, will clean up the implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

context.get().reset();
return context.defaultRewrite(planNode, new JoinContext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can calling rewriter multiple times from a node lead to high time complexity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, we want to do this upfront and not have to discard plans after they have been computed - I'd prefer to do this in a separate PR as this one was meant to fix a large number of the original bugs with the optimization (and the optimization is disabled by default so no latency impact on production workloads at the moment)

@pranjalssh pranjalssh requested a review from kaikalur October 5, 2023 17:02
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from ba00fd4 to b74eac7 Compare October 11, 2023 17:48
@mlyublena mlyublena requested a review from pranjalssh October 11, 2023 17:49
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from 0953a2a to 475ee94 Compare October 11, 2023 21:15
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private static PlanNode ensurePlanHasAllRequiredColumns(FilterNode filterNode, PlanNode newChildNode, RewriteContext<JoinContext> context)
{
if (filterNode.getSources().get(0) == newChildNode) {
// plan cource didn't change, nothing to check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/cource/source

@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch 2 times, most recently from 1698753 to e837b7c Compare October 12, 2023 21:58

Verified

This commit was signed with the committer’s verified signature.
mcollina Matteo Collina
…fixes

Several fixes and improvements related to payload join optimization:
- enable optimization even if the fact table doesn't have any map or array columns
- abort rewrite if there are operators other than select/project/filter/join as it may introduce wrong results
- abort rewrite if collected join keys contain a column from the RHS of the current join
SELECT l.* FROM lineitem l left join orders o on (l.orderkey = o.orderkey) left join part p on (l.partkey=p.partkey) left join (select 1 as m) mm on p.partkey=mm.m

The join key from the last join condition (p.partkey) cannot be pushed to the left-most table in the join (lineitem). This change takes care of only propagating those join keys that
come from the LHS of the join.

- other fixes in propagating optimization context across operators
- additional tests
@mlyublena mlyublena force-pushed the payload-join-enable-nonmap-keys branch from e837b7c to b963b41 Compare October 12, 2023 22:15
@mlyublena mlyublena merged commit bb13af3 into prestodb:master Oct 12, 2023
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants