Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proto): add physical hashJoin & mergeJoin physical relations #336

Merged
merged 1 commit into from
Jan 16, 2023
Merged

feat(proto): add physical hashJoin & mergeJoin physical relations #336

merged 1 commit into from
Jan 16, 2023

Conversation

chaojun-zhang
Copy link
Contributor

@chaojun-zhang chaojun-zhang commented Sep 19, 2022

This PR includes hashJoin and mergeJoin physical plan relations according the spec

@chaojun-zhang
Copy link
Contributor Author

@jvanstraten , sorry for my mistake that close the original PR , I can't reopen it so raise a new PR here

@westonpace
Copy link
Member

Thank you for contributing these. I'm curious about the duplication here, though I don't know if it's a bad thing entirely.

In #331 we decided that trying too hard to avoid duplication (e.g. by basing as-of join on a logical join) was probably a bad idea. However, in this PR we are providing direct physical specializations of the logical join. Do you think it would make sense to use inheritance here? Is there a reason you chose not to? Can you imagine the physical specializations diverging from the logical specifications (other than just adding on new fields) at some point?

@jvanstraten
Copy link
Contributor

I don't know if this is true, but if all physical joins are likely to use the same fields, shouldn't the physical implementation just be an optional field on JoinRel?

If not, for some of the cases like this for the validator-specific protos (especially scalar vs aggregate vs window functions) I made separate messages for the common fields, in this case for example JoinCommon or something. That way, if we add logical-specific or physical-specific fields later, we can do that outside that common structure, but also avoid duplication. It would be a breaking change for JoinRel to add this indirection, though.

@chaojun-zhang
Copy link
Contributor Author

To avoid duplicate, I think we may have two options:

  • option 1: add new field inside JoinRel like below to distinguish logical join , hash join , merge join and nested loop join
JoinKind kind = 7;

 enum JoinKind {
   JOIN_KIND_LOGICAL = 0;
   JOIN_KIND_HASH = 1;
   JOIN_KIND_MERGE = 2;
   JOIN_KIND_NESTED_LOOP = 3;
 }
  • option 2: extract join fields to JoinCommon message, and use inheritance for logical join ,hash join ,merge join and nested loop join

I'm not sure which option is prefer to, my thinking the option 1 may be better since these join operator share the same attributes according to the spec in substrait.io?

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think we should avoid collapsing these. I think they actually need to be different. They have different physical property maintenance characterististics and will probably develop in different directions.

Once we get a good definition, let's make sure we also update the associated site docs.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
@jvanstraten jvanstraten mentioned this pull request Sep 28, 2022
@CLAassistant
Copy link

CLAassistant commented Oct 6, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review is long overdue. I have a few more questions.

Comment on lines 22 to 23
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a physical relation do we want to explicitly declare that the left side is the build side and the right side is the probe side (or vice versa)? It looks like you are already making this assumption in your description of left keys and right keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave it to the user to decide which side is probe or build , your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking the correct side is generally an optimizer task (requires statistics / heuristics). My assumption for more physical nodes is that they've already gone through an optimizer.

For a single optimizer/consumer pair this wouldn't be a problem. They can both agree offline which side should be build and probe. However, imagine an engine-independent substrait optimizer created this plan and hoped for it to be consumed by many different engines. In that case the different consumers might pick left/right build/probe differently.

That being said, I haven't worked much with optimizers. CC @jacques-n who has more experience here.

site/docs/relations/physical_relations.md Outdated Show resolved Hide resolved
site/docs/relations/physical_relations.md Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
@chaojun-zhang
Copy link
Contributor Author

@jacques-n @westonpace Please take a further review and help trigger the check again.

Comment on lines 22 to 23
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we should assign these as probe and build (my preference would be to be explicit) or we should specifically explain that consumers should assign probe & build based on whatever criteria they want. As it is, I think this is ambiguous.

Copy link
Contributor Author

@chaojun-zhang chaojun-zhang Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace , thanks for your comment, I found that the documentation has mentioned that the right input is build-side, see this. So let's always assume the left input is probe-side and right input is build-side. your thoughts?

I also updated in this section [https://substrait.io/relations/physical_relations/#hash-equijoin-properties].

Please help trigger the check again. thanks.

@chaojun-zhang
Copy link
Contributor Author

@westonpace I have update the document and help trigger the check again.

@westonpace
Copy link
Member

:( It seems our release script is failing (unrelated to this change) I will see if I can fix that real quick.

@westonpace
Copy link
Member

Fixed. Please rebase and I will run again.

@westonpace westonpace requested a review from jacques-n January 13, 2023 20:16
westonpace
westonpace previously approved these changes Jan 13, 2023
@westonpace
Copy link
Member

westonpace commented Jan 13, 2023

This looks good to me. Will also need approval from @jacques-n

@github-actions
Copy link

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@chaojun-zhang chaojun-zhang requested review from westonpace and jacques-n and removed request for jacques-n and westonpace January 14, 2023 06:28
@chaojun-zhang chaojun-zhang removed the request for review from westonpace January 15, 2023 13:34
@chaojun-zhang chaojun-zhang requested review from jacques-n and westonpace and removed request for jacques-n January 15, 2023 13:34
@chaojun-zhang
Copy link
Contributor Author

@westonpace PR rebased, Please take a look again. Thanks!

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. With @westonpace approval, I'm going to merge.

@jacques-n jacques-n merged commit 431651e into substrait-io:main Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants