[SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition #2727

rxin · 2014-10-09T04:38:55Z

No description provided.

BroadcastHashJoin builds a new hash table for each partition. We can build it once per node and reuse the hash table.

…sh-1 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoin.scala

AmplabJenkins · 2014-10-09T04:52:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21520/Test FAILed.

SparkQA · 2014-10-09T04:55:52Z

QA tests have started for PR 2727 at commit 18eb214.

This patch merges cleanly.

SparkQA · 2014-10-09T05:40:58Z

QA tests have finished for PR 2727 at commit 18eb214.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-09T05:44:40Z

QA tests have started for PR 2727 at commit 7fcffb5.

This patch merges cleanly.

chenghao-intel · 2014-10-09T05:47:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+
+  override def get(key: Row) = {
+    val v = hashTable.get(key)
+    if (v eq null) null else CompactBuffer(v)


Will that cause too many CompactBuffer object created if there are so many duplicated records in stream side with single match in build side? Or the GeneralHashedRelation performs great enough?

We will have a new operator that specializes for unique key joins.

Sorry, I mean for each row in stream side, will create a CompactBuffer instance if it finds a matched row in build side, this probably too heavy.

Yea. What I meant was we will add a new operator that specializes for unique key joins, and that operator would just call getValue, bypassing the creation of CompactBuffer.

Even so can't we reuse the same compact buffer? Also should the semantic be to return null or an empty buffer?

Should be null, since that's what a normal hashmap would return, no?

This isn't really a normal hashmap its for joins, and an empty compact buffer seems like a pretty clear way to indicate no matches found. Then you don't have to special case null on the other side. You just join with whatever rows are returned.

Though I guess that doesn't work great with your getValue idea below....

SparkQA · 2014-10-09T06:29:49Z

QA tests have finished for PR 2727 at commit 7fcffb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T06:29:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21523/Test PASSed.

rxin · 2014-10-09T23:56:24Z

Ok I updated the code to reuse the CompactBuffer.

SparkQA · 2014-10-09T23:59:53Z

QA tests have started for PR 2727 at commit 97626a1.

This patch merges cleanly.

SparkQA · 2014-10-10T01:18:30Z

QA tests have finished for PR 2727 at commit 97626a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-10T01:18:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21557/Test FAILed.

This reverts commit 97626a1.

AmplabJenkins · 2014-10-10T01:57:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21565/Test FAILed.

SparkQA · 2014-10-10T02:02:32Z

QA tests have started for PR 2727 at commit 9c7b1a2.

This patch merges cleanly.

SparkQA · 2014-10-10T02:46:52Z

QA tests have finished for PR 2727 at commit 9c7b1a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2014-10-10T05:11:45Z

I reverted the compact buffer reuse because it is not safe to do that with JoinedRow.

rxin · 2014-10-12T19:27:55Z

@marmbrus ready to merge?

rxin added 8 commits October 8, 2014 15:22

[SPARK-3857] Create a join package for various join operators.

a39be8c

Fix line length in HashJoin

a070d44

Rename join -> joins package.

cbc664c

Fix line length.

0c0082b

[SPARK-3861] Avoid rebuilding hash tables on each partition

90b58c0

BroadcastHashJoin builds a new hash table for each partition. We can build it once per node and reuse the hash table.

Added a test case.

e0ebdd1

UniqueKeyHashedRelation.get should return null if the value is null.

4b9d0c9

rxin mentioned this pull request Oct 9, 2014

[SPARK-3861][SQL] Avoid rebuilding hash tables on each partition #2722

Closed

Make UniqueKeyHashedRelation private[joins].

7fcffb5

chenghao-intel reviewed Oct 9, 2014
View reviewed changes

Reuse CompactBuffer in UniqueKeyHashedRelation.

97626a1

Revert "Reuse CompactBuffer in UniqueKeyHashedRelation."

9c7b1a2

This reverts commit 97626a1.

asfgit closed this in 39ccaba Oct 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition #2727

[SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition #2727

rxin commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

chenghao-intel Oct 9, 2014

rxin Oct 9, 2014

chenghao-intel Oct 9, 2014

rxin Oct 9, 2014

marmbrus Oct 9, 2014

rxin Oct 9, 2014

marmbrus Oct 9, 2014

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

rxin commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

SparkQA commented Oct 10, 2014

SparkQA commented Oct 10, 2014

rxin commented Oct 10, 2014

rxin commented Oct 12, 2014

[SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition #2727

[SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition #2727

Conversation

rxin commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

chenghao-intel Oct 9, 2014

Choose a reason for hiding this comment

rxin Oct 9, 2014

Choose a reason for hiding this comment

chenghao-intel Oct 9, 2014

Choose a reason for hiding this comment

rxin Oct 9, 2014

Choose a reason for hiding this comment

marmbrus Oct 9, 2014

Choose a reason for hiding this comment

rxin Oct 9, 2014

Choose a reason for hiding this comment

marmbrus Oct 9, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

rxin commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

AmplabJenkins commented Oct 10, 2014

SparkQA commented Oct 10, 2014

SparkQA commented Oct 10, 2014

rxin commented Oct 10, 2014

rxin commented Oct 12, 2014