[SPARK-4987] [SQL] parquet timestamp type support #3820

adrian-wang · 2014-12-29T07:48:50Z

No description provided.

SparkQA · 2014-12-29T07:52:31Z

Test build #24855 has started for PR 3820 at commit d44831a.

This patch merges cleanly.

SparkQA · 2014-12-29T09:05:09Z

Test build #24855 has finished for PR 3820 at commit d44831a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-29T09:05:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24855/
Test PASSed.

SparkQA · 2014-12-30T06:17:34Z

Test build #24884 has started for PR 3820 at commit dc6eaba.

This patch merges cleanly.

adrian-wang · 2014-12-30T06:24:51Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala

@@ -84,7 +86,8 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging {
    // TODO: Why it can be null?
    if (schema == null)  {
      log.debug("falling back to Parquet read schema")
-      schema = ParquetTypesConverter.convertToAttributes(parquetSchema, false)
+      schema = ParquetTypesConverter.convertToAttributes(
+        parquetSchema, new SQLContext(new SparkContext))


The only thing used here inside this SQLContext is the isParquetBinaryAsString and isParquetINT96AsTimestamp. I'll add a comment here if necessary, to point this out clearly.

I don't think its safe to instantiate a SparkContext here as thats a pretty expensive operations and will throw exceptions if there is more than one in a single JVM. We can try to refactor this in the future, but I'd just pass two options here (using named parameters for booleans).

SparkQA · 2014-12-30T07:07:44Z

Test build #24884 has finished for PR 3820 at commit dc6eaba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-30T07:07:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24884/
Test FAILed.

adrian-wang · 2014-12-30T10:45:27Z

retest this please.

SparkQA · 2014-12-30T10:47:38Z

Test build #24889 has started for PR 3820 at commit dc6eaba.

This patch merges cleanly.

SparkQA · 2014-12-30T12:01:06Z

Test build #24889 has finished for PR 3820 at commit dc6eaba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-30T12:01:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24889/
Test PASSed.

SparkQA · 2015-01-04T08:02:35Z

Test build #25027 has started for PR 3820 at commit 44d3ab1.

This patch merges cleanly.

SparkQA · 2015-01-04T09:15:26Z

Test build #25027 has finished for PR 3820 at commit 44d3ab1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-04T09:15:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25027/
Test PASSed.

marmbrus · 2015-01-06T07:23:09Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

+   * When set to true, we always treat INT96Values in Parquet files as timestamp.
+   */
+  private[spark] def isParquetINT96AsTimestamp: Boolean =
+    getConf(PARQUET_INT96_AS_TIMESTAMP, "false").toBoolean


We don't really use INT96 for anything else (and I don't think other systems do either?) so maybe this should be true by default?

marmbrus · 2015-01-06T07:25:23Z

Thanks for doing this, I've been getting a ton of requests for this feature!

Can you also add this new config option to the sql programming guide?

SparkQA · 2015-01-06T08:42:38Z

Test build #25094 has started for PR 3820 at commit d4dbc8a.

This patch merges cleanly.

adrian-wang · 2015-01-06T08:44:02Z

Oh sorry, I just checked Impala's configuration and I think it is not what it is here. I'll change my code to conform to that.

SparkQA · 2015-01-06T10:05:53Z

Test build #25094 has finished for PR 3820 at commit d4dbc8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-06T10:05:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25094/
Test FAILed.

SparkQA · 2015-01-07T10:27:33Z

Test build #25161 has started for PR 3820 at commit 5cb8f97.

This patch merges cleanly.

SparkQA · 2015-01-07T12:05:53Z

Test build #25161 has finished for PR 3820 at commit 5cb8f97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-07T12:05:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25161/
Test PASSed.

marmbrus · 2015-01-07T23:14:49Z

sql/core/pom.xml

+      <groupId>org.jodd</groupId>
+      <artifactId>jodd-core</artifactId>
+      <version>${jodd.version}</version>
+    </dependency>


I'm pretty hesitant to add a dependency here as they are very high cost for a project as big as Spark. Is there any way to do this without adding this?

We can also convert to/from Julian by ourselves... I'll draft it,

Writing it by hand may be dangerous, because of leap seconds [https://en.wikipedia.org/wiki/Leap_second], as the specific date could be inaccurate.
And by check the pom[http://repo1.maven.org/maven2/org/jodd/jodd-core/3.5.2/jodd-core-3.5.2.pom] of jodd-core, there's no additional dependence, so the influence is comparatively small. Use jodd to covert also make everything consistent with hive, so we can be 100% compatible with those data generated by hive. So I'd prefer keep this.

Okay, you are right that its a bad idea to do this by hand. Are there any dependencies that Spark SQL already has that could be used instead?

Okay, I talked to @pwendell and we think it would be better to use Joda time if possible since spark already depends on that library in other subprojects. What do you think?

I have read the documents of joda-time. The toJulianDayNumber API only valid since 2.2, but what we use in spark is 2.1.

FYI most systems don't support leap seconds. I'm not sure why we'd want to support them here...

This mainly aims to make sure we are compatible with hive

Does Hive support leap seconds? I looked into the implementation of jodd -- I don't think it supports that when doing date timestamp conversion. I could be wrong though.

If we are going to do this conversion ourselves, I think it is fine...

SparkQA · 2015-01-08T08:42:34Z

Test build #25211 has started for PR 3820 at commit 8526a33.

This patch merges cleanly.

adrian-wang · 2015-01-08T08:56:27Z

The NanoTime in Hive-0.14 is a little bit different from NanoTime in parquet-examples, and here are some related discussions. So I just rewrite NanoTime in hive into scala, instead of using NanoTime from parquet-examples.

SparkQA · 2015-01-08T10:14:09Z

Test build #25211 has finished for PR 3820 at commit 8526a33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2015-02-03T01:09:11Z

I just rebased my code and upgrade jodd to 3.6.3.

SparkQA · 2015-02-03T01:09:26Z

Test build #26568 has finished for PR 3820 at commit 5152f2a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-03T01:09:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26568/
Test FAILed.

yhuai · 2015-02-03T01:14:20Z

I tried the newly uploaded parquet data in https://issues.apache.org/jira/browse/SPARK-4768 (I set my timezone to UTC), for one line, I got

[test row 5,2015-01-02 20:54:10.000456789]

But, the data was generated by

insert into string_timestamp (dummy,timestamp1) values('test row 5', '2015-01-02 20:54:10.123456789');

Can you take a look?

adrian-wang · 2015-02-03T02:36:31Z

@yhuai Sorry, I got a misunderstanding on the setNanos API, it works OK now.

SparkQA · 2015-02-03T02:37:36Z

Test build #26583 has started for PR 3820 at commit b1e2a0d.

This patch merges cleanly.

SparkQA · 2015-02-03T03:55:30Z

Test build #26583 has finished for PR 3820 at commit b1e2a0d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LogisticGradient(numClasses: Int) extends Gradient
- case class HiveScriptIOSchema (
- val trimed_class = serdeClassName.split("'")(1)

AmplabJenkins · 2015-02-03T03:55:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26583/
Test FAILed.

marmbrus · 2015-02-03T03:56:39Z

Jenkins, test this please

SparkQA · 2015-02-03T03:57:35Z

Test build #26597 has started for PR 3820 at commit b1e2a0d.

This patch merges cleanly.

SparkQA · 2015-02-03T05:20:06Z

Test build #26597 has finished for PR 3820 at commit b1e2a0d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-03T05:20:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26597/
Test FAILed.

yhuai · 2015-02-03T06:37:17Z

Jenkins, test this please

SparkQA · 2015-02-03T06:42:49Z

Test build #26622 has started for PR 3820 at commit b1e2a0d.

This patch merges cleanly.

SparkQA · 2015-02-03T08:22:13Z

Test build #26622 has finished for PR 3820 at commit b1e2a0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-03T08:22:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26622/
Test PASSed.

Author: Daoyuan Wang <[email protected]> Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits: b1e2a0d [Daoyuan Wang] fix for nanos 4dadef1 [Daoyuan Wang] fix wrong read 93f438d [Daoyuan Wang] parquet timestamp support (cherry picked from commit 0c20ce6) Signed-off-by: Michael Armbrust <[email protected]>

marmbrus · 2015-02-03T20:07:05Z

Thanks! Merged to master.

This PR might have some issues with apache#3732 , and this would have merge conflicts with apache#3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <[email protected]> Closes apache#3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type

This PR might have some issues with #3732 , and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <[email protected]> Closes #3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type (cherry picked from commit 4659468) Signed-off-by: Cheng Lian <[email protected]>

adrian-wang mentioned this pull request Dec 29, 2014

[SPARK-4985] [SQL] parquet support for date type #3822

Closed

adrian-wang reviewed Dec 30, 2014
View reviewed changes

adrian-wang force-pushed the parquettimestamp branch from dc6eaba to 44d3ab1 Compare January 4, 2015 07:59

marmbrus reviewed Jan 6, 2015
View reviewed changes

marmbrus reviewed Jan 7, 2015
View reviewed changes

adrian-wang added 3 commits February 2, 2015 18:23

parquet timestamp support

93f438d

fix wrong read

4dadef1

fix for nanos

b1e2a0d

adrian-wang force-pushed the parquettimestamp branch from 5152f2a to b1e2a0d Compare February 3, 2015 02:33

asfgit closed this in 0c20ce6 Feb 3, 2015

[SPARK-4987] [SQL] parquet timestamp type support #3820

[SPARK-4987] [SQL] parquet timestamp type support #3820

Conversation

adrian-wang commented Dec 29, 2014

SparkQA commented Dec 29, 2014

SparkQA commented Dec 29, 2014

AmplabJenkins commented Dec 29, 2014

SparkQA commented Dec 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 30, 2014

AmplabJenkins commented Dec 30, 2014

adrian-wang commented Dec 30, 2014

SparkQA commented Dec 30, 2014

SparkQA commented Dec 30, 2014

AmplabJenkins commented Dec 30, 2014

SparkQA commented Jan 4, 2015

SparkQA commented Jan 4, 2015

AmplabJenkins commented Jan 4, 2015

Choose a reason for hiding this comment

marmbrus commented Jan 6, 2015

SparkQA commented Jan 6, 2015

adrian-wang commented Jan 6, 2015

SparkQA commented Jan 6, 2015

AmplabJenkins commented Jan 6, 2015

SparkQA commented Jan 7, 2015

SparkQA commented Jan 7, 2015

AmplabJenkins commented Jan 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 8, 2015

adrian-wang commented Jan 8, 2015

SparkQA commented Jan 8, 2015

adrian-wang commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

yhuai commented Feb 3, 2015

adrian-wang commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

marmbrus commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

yhuai commented Feb 3, 2015

SparkQA commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015

marmbrus commented Feb 3, 2015