Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-key joins. #4424

Closed
big-andy-coates opened this issue Feb 3, 2020 · 41 comments
Closed

Support non-key joins. #4424

big-andy-coates opened this issue Feb 3, 2020 · 41 comments
Assignees
Milestone

Comments

@big-andy-coates
Copy link
Contributor

With https://cwiki.apache.org/confluence/display/KAFKA/KIP-213+Support+non-key+joining+in+KTable complete, we should be able to enhance ksqlDB to support non-key joins!

@entechlog
Copy link

@big-andy-coates , @vpapavas Is support for non key table join coming to ksqlDB in 0.8.0 release ? We have a project going into production in second quarter of this year and its heavily dependent on table joins which always dont share same primary key. Without this feature GlobalKTable and custom kafka streams may be our only option (which we were trying to avoid with ksql platform)

@vpapavas
Copy link
Member

vpapavas commented Mar 1, 2020

Hi @entechlog, unfortunately foreign key joins are not coming in 0.8.0. How about creating additional tables that have as primary key the join key you are interested in? Simulate the behavior of secondary indeces basically by creating extra tables by hand that have the schema you need for each join you are interested in,

@entechlog
Copy link

@vpapavas ,
I don't think creating an additional table will always help, Unless I am missing something. Please see below example.

Here I need to join the orders table every time when an order comes in with user table and both has different PK's. I can't make USER_ID as PK in orders table, then it will retain only one order record, Unless the PK's are same ksqlDB is not letting to join the tables.

TBL_ORDERS
ORDER_ID - Primary Key
ITEM_ID
ITEM_DESC
QUANTITY
PRICE
USER_ID

TBL_USER
USER_ID - Primary Key
USER_NAME
USER_EMAIL
USER_PHONE

@PeterLindner
Copy link

PeterLindner commented Mar 1, 2020

@entechlog you have to convert your orders table to a stream partitioned by user ID, then you can perform a stream - table join on user ID, repartition back to order ID and convert back to a table
Note however that changes to the user table would not apply to older orders, if you need that I believe the only way to do this is to use the collect_set udaf to collect all order ids grouped by user ID (yields a table), perform a stream-table join with a user stream to get a message for each user modification and then explode an repartition on the collected order ids to propagate the modifications to the orders, then join that stream with the original orders table and insert into the orders stream and again create a table from it.
I hope you could follow 😉 note that I haven't tried that myself yet, but I will have to do something similar for my use case once I get my data into Kafka, so please tell me if it worked.

@entechlog
Copy link

@PeterLindner, Thanks for the input. We are trying several things and will keep you posted on what we end up doing. The example was just to share the idea but we are dealing with much complex with multiple self joins and self joins based on result from aggregation of all historical data.

@bellemare
Copy link

@big-andy-coates I was actually just about to email you about this, as I see you've done a lot of work on the joins to date. I contributed the PR (with some help from many fine people) for KIP-213 mentioned above, and I've been looking at how to get this into KSQL. I've been looking under the hood of KSQL and I admit that I don't have a firm handle on it yet, and was wondering where this sort of work sits in terms of priority for the Confluent folks working on this product. I'm interested in helping if I can, but I admit that I know next to nothing about KSQL's engine.

Thanks Andy!

@bellemare
Copy link

Bumping the question @big-andy-coates

@big-andy-coates big-andy-coates modified the milestones: 0.11.0, 0.12 Jun 17, 2020
@bhamur
Copy link

bhamur commented Jun 29, 2020

@PeterLindner collect_set has a limitation of just 1000 elements in the array. Did you implement non-key joins with any other work-around?

@PeterLindner
Copy link

@bhamur unfortunately not, for my use case I knew, that I'd have at most 3 records in the set

@nvaksmann1992
Copy link

@entechlog did you find a solution for the join with foreign-key ?

I have the same problem and I was wondering if you found a solution yet ? :)

@big-andy-coates
Copy link
Contributor Author

@big-andy-coates I was actually just about to email you about this, as I see you've done a lot of work on the joins to date. I contributed the PR (with some help from many fine people) for KIP-213 mentioned above, and I've been looking at how to get this into KSQL. I've been looking under the hood of KSQL and I admit that I don't have a firm handle on it yet, and was wondering where this sort of work sits in terms of priority for the Confluent folks working on this product. I'm interested in helping if I can, but I admit that I know next to nothing about KSQL's engine.

Thanks Andy!

Hi @bellemare, sorry for the delay in answering your question!

To be honest, I'm not 100% sure where this comes in terms of priority. I know its not part of the next quarter's roadmap. Beyond that, I'm not sure. @MichaelDrogalis or @derekjn may be able to comment more.

It would be great to have you contribute, if you feel able! First steps would be to write up the design proposal: https://github.com/confluentinc/ksql/blob/master/design-proposals/README.md.

Andy

@MichaelDrogalis
Copy link
Contributor

A KIP/patch for this would be 😍

@bellemare
Copy link

Unfortunately, I don't have the time to devote to this in the next 3 months due to personal obligations. We also aren't using KSql where I work at the moment, so I can't reasonably get cycles there to address it.

@big-andy-coates
Copy link
Contributor Author

Understandable. We appreciate your contribution to Streams, and we will leverage your work by exposing it in ksqlDB within the foreseeable future, just no firm date as yet. What this space!

@bellemare
Copy link

I'll keep my eyes and ears open! That being said, if something changes, this would probably be one of the first things I would work on :)

@rtrive
Copy link

rtrive commented Nov 24, 2020

is there some updates for this feature?

@MichaelDrogalis
Copy link
Contributor

Hey @rtrive, nothing yet. But patches welcome from anyone that wants to work on this. ❤️

@MichaelDrogalis MichaelDrogalis unpinned this issue Dec 1, 2020
@MichaelDrogalis
Copy link
Contributor

@yuranos We're working on this right now. :)

@pragmaticivan
Copy link

Howdy! Is there a place we could track code progress for that issue?

@mjsax
Copy link
Member

mjsax commented Apr 7, 2021

There is no PR yet... Stay tuned. Happy to link PRs to this ticket when available.

@mkoziel2000
Copy link

Glad the one-to-many is getting worked on. this specific issue is keeping us from actually treating ksql as anything other than a toy to play around with. Hoping this gets over the finish line soon...it could offset a whole lot of consumer code we currently are having to develop.

@mjsax
Copy link
Member

mjsax commented Apr 29, 2021

First PR: #7452

@mjsax
Copy link
Member

mjsax commented May 11, 2021

Second PR: #7491

@mjsax mjsax modified the milestones: 0.18.0, 0.19.0 May 12, 2021
@mjsax
Copy link
Member

mjsax commented May 12, 2021

Third PR: #7511
Fourth PR: #7517

@mjsax
Copy link
Member

mjsax commented May 14, 2021

Adding more tests: #7526

@mjsax
Copy link
Member

mjsax commented May 14, 2021

Required refactoring: #7499

@bellemare
Copy link

@mjsax I am very pleased to see your work on this!

@mjsax
Copy link
Member

mjsax commented May 15, 2021

It's about time to add it to kslqDB... But it's not just me, but a team effort together with @vcrfxia and @spena :)

@mjsax
Copy link
Member

mjsax commented May 17, 2021

Adding more tests: #7528
Extend logical planner: #7522

@mjsax
Copy link
Member

mjsax commented May 17, 2021

Logical to Physical plan translation (and trying to get the tests green): #7543

@mjsax
Copy link
Member

mjsax commented May 17, 2021

More tests: #7537

@mjsax
Copy link
Member

mjsax commented May 23, 2021

Some more bug fixes: #7576

@mjsax
Copy link
Member

mjsax commented May 25, 2021

Bug fix: #7547
Test framework fix: #7579
Fixing long standing but expose by FK-join work: #7585

@mjsax
Copy link
Member

mjsax commented May 26, 2021

Add missing test: #7588

@mjsax
Copy link
Member

mjsax commented May 26, 2021

Another bug fix: #7592

@mjsax
Copy link
Member

mjsax commented May 26, 2021

Enable FK-joins: #7591

@mjsax
Copy link
Member

mjsax commented May 26, 2021

Enable more tests: #7593

@mjsax
Copy link
Member

mjsax commented Jun 4, 2021

Updating docs: #7628

@shrugal
Copy link

shrugal commented Jun 4, 2021

Hey @mjsax thanks for posting these updates! Do I read it right that it's actually a N:1 join, so a row from the left table is joined with just one from the right, but one and the same row from the right table can be joined to multiple different rows from the left? For example with the tables users and orders FROM orders JOIN users ON orders.user_id = users.id is possible, but FROM users JOIN orders ON users.id = orders.user_id is not?

@mjsax
Copy link
Member

mjsax commented Jun 4, 2021

That is correct. If you think in DW terms, the left table would be your "fact table" and you can do FK-lookups into the right "dimension table".

It would be possible of course, to enhance ksqlDB to also allow the second query (and I am sure we will do this at some point). It's similar to the missing "RIGHT JOIN" support -- we don't support t1 RIGHT JOIN t2 but you can still do the join as t2 LEFT JOIN t1... We need to enhance ksqlDB to internal "flip" left/right input to support this. It's in our backlog :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests