-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21649][SQL] Support writing data into hive bucket table. #18866
Conversation
I added the unit test referring (https://github.com/apache/hive/blob/branch-1/ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java#L393). |
def partitionIdExpression: Expression = Pmod(new Murmur3Hash(expressions), Literal(numPartitions)) | ||
def partitionIdExpression(useHiveHash: Boolean = false): Expression = | ||
if (useHiveHash) { | ||
Pmod(new HiveHash(expressions), Literal(numPartitions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that HiveHash simulates Hive's hashing function from Hive v1.2.1...
. Is there any compatibility issue for Hive before 1.2.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks a lot for comment !
I compared code between v0.13(https://github.com/apache/hive/blob/branch-0.13/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L496) and v1.2.1(https://github.com/apache/hive/blob/branch-1/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L557). In my understanding, there's no compatibility issue. v1.2.1 add hashcode
support for more types(INTERVAL_YEAR_MONTH
, INTERVAL_DAY_TIME
), for the existing types, they are compatible with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let's add a comment for the parameter useHiveHash
.
Test build #80322 has finished for PR 18866 at commit
|
Jenkins, retest this please. |
@@ -103,7 +103,7 @@ object FileFormatWriter extends Logging { | |||
outputSpec: OutputSpec, | |||
hadoopConf: Configuration, | |||
partitionColumns: Seq[Attribute], | |||
bucketSpec: Option[BucketSpec], | |||
bucketSpecAndHash: Option[(BucketSpec, Boolean)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to know the meaning of this Option parameter. Let's add a @param
for it?
.load(bucketFiles(bucket).getAbsolutePath) | ||
.collect().map(_.getString(0).split("\t")(0).toInt) | ||
.zip((bucket to (100, 4))).foreach { case (a, b) => | ||
assert(a === b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks obscure. I need to verify it by calculating HiveHash values for 0 until 100. Maybe we should compute the actual hive hash value here, instead of bucket to (100, 4)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be better. I've updated, please take another look : )
Test build #80341 has finished for PR 18866 at commit
|
@viirya |
Test build #80348 has finished for PR 18866 at commit
|
Test build #80349 has finished for PR 18866 at commit
|
cc @cloud-fan |
Hash function is not the only issue, one important difference is: hive will shuffle before write, and make sure one bucket has only one file. Spark doesn't shuffle and each write task may write a file for a bucket. More details please refer to https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit#heading=h.ualze2k709kj also cc @tejasapatil |
Test build #80597 has finished for PR 18866 at commit
|
Test build #80637 has finished for PR 18866 at commit
|
Test build #80782 has finished for PR 18866 at commit
|
Test build #80837 has finished for PR 18866 at commit
|
Test build #80873 has finished for PR 18866 at commit
|
In current change:
|
@cloud-fan @gatorsmile @jiangxb1987 |
does this work with append? Even you shuffle the data before writing, we still may have multiple files for one bucket. Is it possible to generalize this patch to data source level? The current approach looks very hacky and is way away from our expection that hive is also a data source. |
@cloud-fan |
What changes were proposed in this pull request?
Support writing hive bucket table. Spark internally uses Murmur3Hash for partitioning. We can use hive hash for compatibility when write to bucket table.
How was this patch tested?
Unit test.