-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(proto): add physical hashJoin & mergeJoin physical relations #336
Conversation
@jvanstraten , sorry for my mistake that close the original PR , I can't reopen it so raise a new PR here |
Thank you for contributing these. I'm curious about the duplication here, though I don't know if it's a bad thing entirely. In #331 we decided that trying too hard to avoid duplication (e.g. by basing as-of join on a logical join) was probably a bad idea. However, in this PR we are providing direct physical specializations of the logical join. Do you think it would make sense to use inheritance here? Is there a reason you chose not to? Can you imagine the physical specializations diverging from the logical specifications (other than just adding on new fields) at some point? |
I don't know if this is true, but if all physical joins are likely to use the same fields, shouldn't the physical implementation just be an optional field on If not, for some of the cases like this for the validator-specific protos (especially scalar vs aggregate vs window functions) I made separate messages for the common fields, in this case for example |
To avoid duplicate, I think we may have two options:
I'm not sure which option is prefer to, my thinking the option 1 may be better since these join operator share the same attributes according to the spec in substrait.io? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think we should avoid collapsing these. I think they actually need to be different. They have different physical property maintenance characterististics and will probably develop in different directions.
Once we get a good definition, let's make sure we also update the associated site docs.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review is long overdue. I have a few more questions.
| Left Input | A relational input. | Required | | ||
| Right Input | A relational input. | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a physical relation do we want to explicitly declare that the left side is the build side and the right side is the probe side (or vice versa)? It looks like you are already making this assumption in your description of left keys and right keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave it to the user to decide which side is probe or build , your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking the correct side is generally an optimizer task (requires statistics / heuristics). My assumption for more physical nodes is that they've already gone through an optimizer.
For a single optimizer/consumer pair this wouldn't be a problem. They can both agree offline which side should be build and probe. However, imagine an engine-independent substrait optimizer created this plan and hoped for it to be consumed by many different engines. In that case the different consumers might pick left/right build/probe differently.
That being said, I haven't worked much with optimizers. CC @jacques-n who has more experience here.
@jacques-n @westonpace Please take a further review and help trigger the check again. |
| Left Input | A relational input. | Required | | ||
| Right Input | A relational input. | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either we should assign these as probe and build (my preference would be to be explicit) or we should specifically explain that consumers should assign probe & build based on whatever criteria they want. As it is, I think this is ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace , thanks for your comment, I found that the documentation has mentioned that the right input is build-side, see this. So let's always assume the left input is probe-side and right input is build-side. your thoughts?
I also updated in this section [https://substrait.io/relations/physical_relations/#hash-equijoin-properties].
Please help trigger the check again. thanks.
@westonpace I have update the document and help trigger the check again. |
:( It seems our release script is failing (unrelated to this change) I will see if I can fix that real quick. |
Fixed. Please rebase and I will run again. |
This looks good to me. Will also need approval from @jacques-n |
ACTION NEEDED Substrait follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@westonpace PR rebased, Please take a look again. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. With @westonpace approval, I'm going to merge.
This PR includes hashJoin and mergeJoin physical plan relations according the spec