Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a low-level HASH-JOIN dataset #622

Open
jjeffcaii opened this issue Feb 16, 2023 · 1 comment
Open

Implement a low-level HASH-JOIN dataset #622

jjeffcaii opened this issue Feb 16, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jjeffcaii
Copy link
Contributor

jjeffcaii commented Feb 16, 2023

The HASH-JOIN dataset API could be similar with below codes:

func HashJoin(left,right Dataset, joinColumns, ...other options) Dataset {
    //  ...
}

The HASH-JOIN should contain two phases:

  1. build hash chunk from the left dataset, a chunk looks like a hash map: key=hash(values of join_columns), value=rows
  2. probe each row in right dataset, compute the hash key, then check if the row matched one by one

some docs:

A tiny example, we have two datasets, and we want to execute SQL like select foo.id,bar.id from foo join bar on foo.x = bar.y

--- Dataset foo

id x
a 5
b 6
c 7

--- Dataset bar

id y
j 5
k 8
  1. build from foo: a hash map by hash method x -> x%2, we got a map like { 0: [b-6], 1: [a-5,c-7] }
  2. probe from bar: j-5 will check the chunk[key=1] and the k-8 will check the chunk[key=0]
  3. finally, a-5 and j-5, bingo!
@jjeffcaii jjeffcaii added the enhancement New feature or request label Feb 16, 2023
@jjeffcaii jjeffcaii added this to the 0.2.0 milestone Feb 16, 2023
@wang1309
Copy link
Contributor

pls assign to me

@dongzl dongzl modified the milestones: 0.2.0, 0.3.0 Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants