-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4570][SQL]add BroadcastLeftSemiJoinHash #3442
Conversation
Can one of the admins verify this patch? |
ok to test |
Test build #23972 has started for PR 3442 at commit
|
Test build #23972 has finished for PR 3442 at commit
|
Test FAILed. |
retest this please |
Test build #23983 has started for PR 3442 at commit
|
Test build #23983 has finished for PR 3442 at commit
|
Test PASSed. |
|
||
override def execute() = { | ||
|
||
val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove blank line
Thanks for working on this! |
case (query, joinClass) => assertJoin(query, joinClass) | ||
} | ||
|
||
sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-$tmp""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-$tmp
: typo? And we can just use setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, tmp.toString)
here.
Test build #24025 has started for PR 3442 at commit
|
Test build #24025 has finished for PR 3442 at commit
|
Test PASSed. |
Test build #24122 has started for PR 3442 at commit
|
Test build #24122 has finished for PR 3442 at commit
|
Test PASSed. |
4fdcfe7
to
3a58191
Compare
Test build #24533 has started for PR 3442 at commit
|
Test build #24533 has finished for PR 3442 at commit
|
Test FAILed. |
3a58191
to
f103983
Compare
Test build #24753 has started for PR 3442 at commit
|
Test build #24753 has finished for PR 3442 at commit
|
Test PASSed. |
Thanks, merged to master. |
JIRA issue: SPARK-4570
We are planning to create a
BroadcastLeftSemiJoinHash
to implement the broadcast join forleft semijoin
In left semijoin :
If the size of data from right side is smaller than the user-settable threshold
AUTO_BROADCASTJOIN_THRESHOLD
,the planner would mark it as the
broadcast
relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as aorg.apache.spark.broadcast.Broadcast
object. It will usejoins.BroadcastLeftSemiJoinHash
.,else it will usejoins.LeftSemiJoinHash
.The benchmark suggests these made the optimized version 4x faster when
left semijoin
The micro benchmark load
data1/kv3.txt
into a normal Hive table.Benchmark code: