Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execinfrapb: add InvertedJoinerSpec for joins between #48453

Merged
merged 1 commit into from
May 8, 2020

Conversation

sumeerbhola
Copy link
Collaborator

two tables using an inverted index on one table

This is extracted from #48019 which has more context, including
the implementation.

Release note: None

two tables using an inverted index on one table

This is extracted from cockroachdb#48019 which has more context, including
the implementation.

Release note: None
@sumeerbhola sumeerbhola requested review from jordanlewis, rytaft, asubiotto and a team May 5, 2020 19:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto
Copy link
Contributor

Can we reuse the JoinReaderSpec instead of creating a new spec? I think the only difference is the inverted expression, which would signal to execution that an inverted joiner should be used.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something about the CockroachDB context, so I am very open to being convinced otherwise, but currently I'm not seeing the benefit:

  • there is a long comment before InvertedJoinSpec that is important for explaining what this is about. That will need to be mashed together with the JoinReaderSpec comment which I already find a bit hard to understand.
  • Both have Table, IndexIdx, but note that IndexIdx needs a different comment.
  • InvertedJoinerSpec has a single LookupColumn -- that can be stuffed into the LookupColumns slice, but again a different comment
  • Remaining fields are InvertedExpr, OnExpr and Type: First is unique to invertedJoiner. All 3 need comments very specific to inverted joins.

A mashup of these two will need to have a long comment explaining the different cases and for each field also state the different cases. That would hinder readability. There is a history in proto land of mashing together different things into a single proto and then being left with a bad headache later (there are infamous examples at Google preceding extensions support in proto2), but those were all for transport purposes (and eventual demultiplexing, which requires the same proto type). That would not be an issue here. Also, these similarities between specs are shared among some of the other specs like HashJoinerSpec and MergeJoinerSpec, so I am not sure why we are trying to reuse the same proto here.

Are you concerned that this will ensure that we will never merge invertedJoiner and joinReader? If yes, my current thinking is

  • any merging will need to process the specs and set up the data-structures somewhat differently, so a different spec is not a blocker
  • currently I don't see any benefit from merging the join execution code -- I think it will obscure the code logic, as I mentioned in a detailed comment in [DNM] rowexec, rowcontainer: add an invertedJoiner, for #48019 (which I've copy-pasted below [1] since I think it got lost in the large variety of comments in that PR) and in a comment in rowexec: make lookup joins with no required ordering more efficient #48439 . And if we get serious about inverted indexes, especially the on-disk representation, the logic will further diverge. Perhaps the issue is that the "inverted join" was initially incorrectly characterized as a "lookup join" -- it isn't really a lookup join.

Thoughts?

[1]
Stylistically it follows a similar pattern. Next() is very similar though that seems hardly to justify refactoring since it is a trivial for loop. The differences are unfortunately buried in each method:

readInput() needs to do a different thing with each input row. And then when the batch is read it needs to do a different thing for preparing the spans to read.
performLookup() does a different thing with the read row: both the transformation to the PK row and when the scan is done the call to batchedInvertedExprEvaluator.
emitRow starts looking similar but unlike joinReader we can't short-circuit the processing for semi-join and anti-join in the PerformLookup() stage -- it needs to happen in emitRow. And emitRow is iterating over a different data-structure (joinedRowIdx) and doing a different transformation on what it retrieves from the row container.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @rytaft)

@asubiotto
Copy link
Contributor

Are you concerned that this will ensure that we will never merge invertedJoiner and joinReader?

No, I think this is different.

The issues with the field comments can be resolved by being more general (e.g. index_idx is the id of the index we're looking up, regardless of the type of join). I agree that the general comment is more problematic so maybe it's a good argument that we're overloading a spec definition.

My biggest concern is that whenever we add fields that apply to all joins, we'll have to make sure to add these to each spec. For example, should this spec also add Visibility, LockingStrength or LockingWaitPolicy since it uses a row fetcher? It looks like we already add the same field to multiple specs though so maybe that's fine 🤷 This could also be an orthogonal problem, since you could argue that these fields should probably be encapsulated in a separate protobuf referenced by all specs that need them.

I think I agree with you that it's bad to overload spec definitions. I wish there was a better way to share common functionality though.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, should this spec also add Visibility, LockingStrength or LockingWaitPolicy since it uses a row fetcher? It looks like we already add the same field to multiple specs though so maybe that's fine

Perhaps, but I can't come up with a real-world example where we would use the inverted index to narrow the rows to mutate, so we should not add the ability to lock the inverted index rows or add ScanVisibility. We could revisit this if there is a compelling example from a real user, but I suspect that will be far in the future or never.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jordanlewis, and @rytaft)

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @rytaft)

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @jordanlewis)

@sumeerbhola
Copy link
Collaborator Author

TFTR!

@sumeerbhola
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 8, 2020

Build succeeded

@craig craig bot merged commit e35af53 into cockroachdb:master May 8, 2020
@sumeerbhola sumeerbhola deleted the inverted_joiner_spec branch May 8, 2020 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants