Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4987] [SQL] parquet timestamp type support #3820

Closed
wants to merge 3 commits into from

Conversation

adrian-wang
Copy link
Contributor

No description provided.

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24855 has started for PR 3820 at commit d44831a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 29, 2014

Test build #24855 has finished for PR 3820 at commit d44831a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24855/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24884 has started for PR 3820 at commit dc6eaba.

  • This patch merges cleanly.

@@ -84,7 +86,8 @@ private[parquet] class RowReadSupport extends ReadSupport[Row] with Logging {
// TODO: Why it can be null?
if (schema == null) {
log.debug("falling back to Parquet read schema")
schema = ParquetTypesConverter.convertToAttributes(parquetSchema, false)
schema = ParquetTypesConverter.convertToAttributes(
parquetSchema, new SQLContext(new SparkContext))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing used here inside this SQLContext is the isParquetBinaryAsString and isParquetINT96AsTimestamp. I'll add a comment here if necessary, to point this out clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its safe to instantiate a SparkContext here as thats a pretty expensive operations and will throw exceptions if there is more than one in a single JVM. We can try to refactor this in the future, but I'd just pass two options here (using named parameters for booleans).

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24884 has finished for PR 3820 at commit dc6eaba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24884/
Test FAILed.

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24889 has started for PR 3820 at commit dc6eaba.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 30, 2014

Test build #24889 has finished for PR 3820 at commit dc6eaba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24889/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 4, 2015

Test build #25027 has started for PR 3820 at commit 44d3ab1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 4, 2015

Test build #25027 has finished for PR 3820 at commit 44d3ab1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25027/
Test PASSed.

* When set to true, we always treat INT96Values in Parquet files as timestamp.
*/
private[spark] def isParquetINT96AsTimestamp: Boolean =
getConf(PARQUET_INT96_AS_TIMESTAMP, "false").toBoolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really use INT96 for anything else (and I don't think other systems do either?) so maybe this should be true by default?

@marmbrus
Copy link
Contributor

marmbrus commented Jan 6, 2015

Thanks for doing this, I've been getting a ton of requests for this feature!

Can you also add this new config option to the sql programming guide?

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25094 has started for PR 3820 at commit d4dbc8a.

  • This patch merges cleanly.

@adrian-wang
Copy link
Contributor Author

Oh sorry, I just checked Impala's configuration and I think it is not what it is here. I'll change my code to conform to that.

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25094 has finished for PR 3820 at commit d4dbc8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25094/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25161 has started for PR 3820 at commit 5cb8f97.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25161 has finished for PR 3820 at commit 5cb8f97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25161/
Test PASSed.

<groupId>org.jodd</groupId>
<artifactId>jodd-core</artifactId>
<version>${jodd.version}</version>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty hesitant to add a dependency here as they are very high cost for a project as big as Spark. Is there any way to do this without adding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also convert to/from Julian by ourselves... I'll draft it,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing it by hand may be dangerous, because of leap seconds [https://en.wikipedia.org/wiki/Leap_second], as the specific date could be inaccurate.
And by check the pom[http://repo1.maven.org/maven2/org/jodd/jodd-core/3.5.2/jodd-core-3.5.2.pom] of jodd-core, there's no additional dependence, so the influence is comparatively small. Use jodd to covert also make everything consistent with hive, so we can be 100% compatible with those data generated by hive. So I'd prefer keep this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you are right that its a bad idea to do this by hand. Are there any dependencies that Spark SQL already has that could be used instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I talked to @pwendell and we think it would be better to use Joda time if possible since spark already depends on that library in other subprojects. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read the documents of joda-time. The toJulianDayNumber API only valid since 2.2, but what we use in spark is 2.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI most systems don't support leap seconds. I'm not sure why we'd want to support them here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mainly aims to make sure we are compatible with hive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Hive support leap seconds? I looked into the implementation of jodd -- I don't think it supports that when doing date timestamp conversion. I could be wrong though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to do this conversion ourselves, I think it is fine...

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25211 has started for PR 3820 at commit 8526a33.

  • This patch merges cleanly.

@adrian-wang
Copy link
Contributor Author

The NanoTime in Hive-0.14 is a little bit different from NanoTime in parquet-examples, and here are some related discussions. So I just rewrite NanoTime in hive into scala, instead of using NanoTime from parquet-examples.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25211 has finished for PR 3820 at commit 8526a33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

I just rebased my code and upgrade jodd to 3.6.3.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26568 has finished for PR 3820 at commit 5152f2a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26568/
Test FAILed.

@yhuai
Copy link
Contributor

yhuai commented Feb 3, 2015

I tried the newly uploaded parquet data in https://issues.apache.org/jira/browse/SPARK-4768 (I set my timezone to UTC), for one line, I got

[test row 5,2015-01-02 20:54:10.000456789]

But, the data was generated by

insert into string_timestamp (dummy,timestamp1) values('test row 5', '2015-01-02 20:54:10.123456789');

Can you take a look?

@adrian-wang
Copy link
Contributor Author

@yhuai Sorry, I got a misunderstanding on the setNanos API, it works OK now.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26583 has started for PR 3820 at commit b1e2a0d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26583 has finished for PR 3820 at commit b1e2a0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticGradient(numClasses: Int) extends Gradient
    • case class HiveScriptIOSchema (
    • val trimed_class = serdeClassName.split("'")(1)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26583/
Test FAILed.

@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26597 has started for PR 3820 at commit b1e2a0d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26597 has finished for PR 3820 at commit b1e2a0d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26597/
Test FAILed.

@yhuai
Copy link
Contributor

yhuai commented Feb 3, 2015

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26622 has started for PR 3820 at commit b1e2a0d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 3, 2015

Test build #26622 has finished for PR 3820 at commit b1e2a0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26622/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 3, 2015
Author: Daoyuan Wang <[email protected]>

Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits:

b1e2a0d [Daoyuan Wang] fix for nanos
4dadef1 [Daoyuan Wang] fix wrong read
93f438d [Daoyuan Wang] parquet timestamp support

(cherry picked from commit 0c20ce6)
Signed-off-by: Michael Armbrust <[email protected]>
@marmbrus
Copy link
Contributor

marmbrus commented Feb 3, 2015

Thanks! Merged to master.

@asfgit asfgit closed this in 0c20ce6 Feb 3, 2015
guavuslabs-builder pushed a commit to ThalesGroup/spark that referenced this pull request Mar 23, 2015
This PR might have some issues with apache#3732 ,
and this would have merge conflicts with apache#3820 so the review can be delayed till that 2 were merged.

Author: Daoyuan Wang <[email protected]>

Closes apache#3822 from adrian-wang/parquetdate and squashes the following commits:

2c5d54d [Daoyuan Wang] add a test case
faef887 [Daoyuan Wang] parquet support for primitive date
97e9080 [Daoyuan Wang] parquet support for date type
asfgit pushed a commit that referenced this pull request Mar 23, 2015
This PR might have some issues with #3732 ,
and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged.

Author: Daoyuan Wang <[email protected]>

Closes #3822 from adrian-wang/parquetdate and squashes the following commits:

2c5d54d [Daoyuan Wang] add a test case
faef887 [Daoyuan Wang] parquet support for primitive date
97e9080 [Daoyuan Wang] parquet support for date type

(cherry picked from commit 4659468)
Signed-off-by: Cheng Lian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants