-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execinfrapb: add InvertedJoinerSpec for joins between #48453
Conversation
two tables using an inverted index on one table This is extracted from cockroachdb#48019 which has more context, including the implementation. Release note: None
Can we reuse the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something about the CockroachDB context, so I am very open to being convinced otherwise, but currently I'm not seeing the benefit:
- there is a long comment before InvertedJoinSpec that is important for explaining what this is about. That will need to be mashed together with the JoinReaderSpec comment which I already find a bit hard to understand.
- Both have
Table
,IndexIdx
, but note thatIndexIdx
needs a different comment. - InvertedJoinerSpec has a single
LookupColumn
-- that can be stuffed into theLookupColumns
slice, but again a different comment - Remaining fields are
InvertedExpr
,OnExpr
andType
: First is unique to invertedJoiner. All 3 need comments very specific to inverted joins.
A mashup of these two will need to have a long comment explaining the different cases and for each field also state the different cases. That would hinder readability. There is a history in proto land of mashing together different things into a single proto and then being left with a bad headache later (there are infamous examples at Google preceding extensions
support in proto2), but those were all for transport purposes (and eventual demultiplexing, which requires the same proto type). That would not be an issue here. Also, these similarities between specs are shared among some of the other specs like HashJoinerSpec
and MergeJoinerSpec
, so I am not sure why we are trying to reuse the same proto here.
Are you concerned that this will ensure that we will never merge invertedJoiner
and joinReader
? If yes, my current thinking is
- any merging will need to process the specs and set up the data-structures somewhat differently, so a different spec is not a blocker
- currently I don't see any benefit from merging the join execution code -- I think it will obscure the code logic, as I mentioned in a detailed comment in [DNM] rowexec, rowcontainer: add an invertedJoiner, for #48019 (which I've copy-pasted below [1] since I think it got lost in the large variety of comments in that PR) and in a comment in rowexec: make lookup joins with no required ordering more efficient #48439 . And if we get serious about inverted indexes, especially the on-disk representation, the logic will further diverge. Perhaps the issue is that the "inverted join" was initially incorrectly characterized as a "lookup join" -- it isn't really a lookup join.
Thoughts?
[1]
Stylistically it follows a similar pattern. Next() is very similar though that seems hardly to justify refactoring since it is a trivial for loop. The differences are unfortunately buried in each method:
readInput() needs to do a different thing with each input row. And then when the batch is read it needs to do a different thing for preparing the spans to read.
performLookup() does a different thing with the read row: both the transformation to the PK row and when the scan is done the call to batchedInvertedExprEvaluator.
emitRow starts looking similar but unlike joinReader we can't short-circuit the processing for semi-join and anti-join in the PerformLookup() stage -- it needs to happen in emitRow. And emitRow is iterating over a different data-structure (joinedRowIdx) and doing a different transformation on what it retrieves from the row container.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @rytaft)
No, I think this is different. The issues with the field comments can be resolved by being more general (e.g. My biggest concern is that whenever we add fields that apply to all joins, we'll have to make sure to add these to each spec. For example, should this spec also add I think I agree with you that it's bad to overload spec definitions. I wish there was a better way to share common functionality though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, should this spec also add Visibility, LockingStrength or LockingWaitPolicy since it uses a row fetcher? It looks like we already add the same field to multiple specs though so maybe that's fine
Perhaps, but I can't come up with a real-world example where we would use the inverted index to narrow the rows to mutate, so we should not add the ability to lock the inverted index rows or add ScanVisibility
. We could revisit this if there is a compelling example from a real user, but I suspect that will be far in the future or never.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @jordanlewis)
TFTR! |
bors r+ |
Build succeeded |
two tables using an inverted index on one table
This is extracted from #48019 which has more context, including
the implementation.
Release note: None