Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloom filter Join Step I: create benchmark #11933

Closed
wants to merge 2 commits into from

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

part of #7955

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Lordworms -- I realize i am very behind on reviews in DataFusion

My first question on these benchmarks is are they measuring the right thing (namely are they dominated by the join time). Have you had a chance to run any profiling (flamegraphs, etc) to confirm these benchmarks are actually join dominated?

datafusion/core/benches/bloom_filter_join.rs Outdated Show resolved Hide resolved
@alamb alamb marked this pull request as draft August 20, 2024 17:47
@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@Lordworms
Copy link
Contributor Author

for TPCH query 17, when we create 1000000 rows for lineitem and part table, the time spent on join is 50% (the other 80% of time spent on creating parquet files)
Screenshot 2024-08-25 at 1 37 15 PM

@Lordworms
Copy link
Contributor Author

For the second case, 95% of the time spent on join
image

@Lordworms
Copy link
Contributor Author

I think it worth a try to implement join predicate pushdown

@Lordworms Lordworms marked this pull request as ready for review August 25, 2024 22:06
@alamb alamb marked this pull request as draft October 18, 2024 20:28
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

I suggest we revive this PR as it seems to have gotten lost / not reviewed 😢

@Lordworms
Copy link
Contributor Author

I suggest we revive this PR as it seems to have gotten lost / not reviewed 😢

I am still working on the implementation of actual join_pushdown, I'll push a complete PR once it is done

@Lordworms
Copy link
Contributor Author

I suggest we revive this PR as it seems to have gotten lost / not reviewed 😢

I am still working on the implementation of actual "hash_join build side statistic pushdown", I'll push a complete PR once it is done

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Dec 18, 2024
@github-actions github-actions bot closed this Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants