-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4987] [SQL] parquet timestamp type support #3820
Conversation
Test build #24855 has started for PR 3820 at commit
|
Test build #24855 has finished for PR 3820 at commit
|
Test PASSed. |
Test build #24884 has started for PR 3820 at commit
|
@@ -84,7 +86,8 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging { | |||
// TODO: Why it can be null? | |||
if (schema == null) { | |||
log.debug("falling back to Parquet read schema") | |||
schema = ParquetTypesConverter.convertToAttributes(parquetSchema, false) | |||
schema = ParquetTypesConverter.convertToAttributes( | |||
parquetSchema, new SQLContext(new SparkContext)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing used here inside this SQLContext is the isParquetBinaryAsString
and isParquetINT96AsTimestamp
. I'll add a comment here if necessary, to point this out clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its safe to instantiate a SparkContext here as thats a pretty expensive operations and will throw exceptions if there is more than one in a single JVM. We can try to refactor this in the future, but I'd just pass two options here (using named parameters for booleans).
Test build #24884 has finished for PR 3820 at commit
|
Test FAILed. |
retest this please. |
Test build #24889 has started for PR 3820 at commit
|
Test build #24889 has finished for PR 3820 at commit
|
Test PASSed. |
dc6eaba
to
44d3ab1
Compare
Test build #25027 has started for PR 3820 at commit
|
Test build #25027 has finished for PR 3820 at commit
|
Test PASSed. |
* When set to true, we always treat INT96Values in Parquet files as timestamp. | ||
*/ | ||
private[spark] def isParquetINT96AsTimestamp: Boolean = | ||
getConf(PARQUET_INT96_AS_TIMESTAMP, "false").toBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really use INT96 for anything else (and I don't think other systems do either?) so maybe this should be true by default?
Thanks for doing this, I've been getting a ton of requests for this feature! Can you also add this new config option to the sql programming guide? |
Test build #25094 has started for PR 3820 at commit
|
Oh sorry, I just checked Impala's configuration and I think it is not what it is here. I'll change my code to conform to that. |
Test build #25094 has finished for PR 3820 at commit
|
Test FAILed. |
Test build #25161 has started for PR 3820 at commit
|
Test build #25161 has finished for PR 3820 at commit
|
Test PASSed. |
<groupId>org.jodd</groupId> | ||
<artifactId>jodd-core</artifactId> | ||
<version>${jodd.version}</version> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty hesitant to add a dependency here as they are very high cost for a project as big as Spark. Is there any way to do this without adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also convert to/from Julian by ourselves... I'll draft it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing it by hand may be dangerous, because of leap seconds [https://en.wikipedia.org/wiki/Leap_second], as the specific date could be inaccurate.
And by check the pom[http://repo1.maven.org/maven2/org/jodd/jodd-core/3.5.2/jodd-core-3.5.2.pom] of jodd-core, there's no additional dependence, so the influence is comparatively small. Use jodd to covert also make everything consistent with hive, so we can be 100% compatible with those data generated by hive. So I'd prefer keep this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you are right that its a bad idea to do this by hand. Are there any dependencies that Spark SQL already has that could be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I talked to @pwendell and we think it would be better to use Joda time if possible since spark already depends on that library in other subprojects. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read the documents of joda-time. The toJulianDayNumber API only valid since 2.2, but what we use in spark is 2.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI most systems don't support leap seconds. I'm not sure why we'd want to support them here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mainly aims to make sure we are compatible with hive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Hive support leap seconds? I looked into the implementation of jodd -- I don't think it supports that when doing date timestamp conversion. I could be wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to do this conversion ourselves, I think it is fine...
Test build #25211 has started for PR 3820 at commit
|
The NanoTime in Hive-0.14 is a little bit different from NanoTime in parquet-examples, and here are some related discussions. So I just rewrite |
Test build #25211 has finished for PR 3820 at commit
|
I just rebased my code and upgrade jodd to 3.6.3. |
Test build #26568 has finished for PR 3820 at commit
|
Test FAILed. |
I tried the newly uploaded parquet data in https://issues.apache.org/jira/browse/SPARK-4768 (I set my timezone to UTC), for one line, I got
But, the data was generated by
Can you take a look? |
5152f2a
to
b1e2a0d
Compare
@yhuai Sorry, I got a misunderstanding on the setNanos API, it works OK now. |
Test build #26583 has started for PR 3820 at commit
|
Test build #26583 has finished for PR 3820 at commit
|
Test FAILed. |
Jenkins, test this please |
Test build #26597 has started for PR 3820 at commit
|
Test build #26597 has finished for PR 3820 at commit
|
Test FAILed. |
Jenkins, test this please |
Test build #26622 has started for PR 3820 at commit
|
Test build #26622 has finished for PR 3820 at commit
|
Test PASSed. |
Author: Daoyuan Wang <[email protected]> Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits: b1e2a0d [Daoyuan Wang] fix for nanos 4dadef1 [Daoyuan Wang] fix wrong read 93f438d [Daoyuan Wang] parquet timestamp support (cherry picked from commit 0c20ce6) Signed-off-by: Michael Armbrust <[email protected]>
Thanks! Merged to master. |
This PR might have some issues with apache#3732 , and this would have merge conflicts with apache#3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <[email protected]> Closes apache#3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type
This PR might have some issues with #3732 , and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <[email protected]> Closes #3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type (cherry picked from commit 4659468) Signed-off-by: Cheng Lian <[email protected]>
No description provided.