[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python #6250

davies · 2015-05-19T00:40:15Z

Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.

The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.

Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.

This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).

davies · 2015-05-19T00:41:29Z

cc @mengxr

AmplabJenkins · 2015-05-19T00:42:11Z

Merged build triggered.

AmplabJenkins · 2015-05-19T00:42:16Z

Merged build started.

SparkQA · 2015-05-19T00:43:03Z

Test build #33038 has started for PR 6250 at commit 6a29aa4.

airhorns · 2015-05-19T01:41:02Z

@davies do you think SparkSQL will ever support timezones? This patch means that round-tripping any timezone aware datetime through SparkSQL will both lose the original timezone information and return the value timezone naive, which I think isn't really any better than the current state. I imagine it will be very surprising and frustrating for developers who aren't intimately familiar with why Spark might choose to behave this way. What about adding a new rule that all Dates or Calendars in SparkSQL are converted to UTC before being stored internally or something like that? Sounds like a performance killer, but don't Java/Scala devs have this same loss of information issue?

davies · 2015-05-19T03:04:29Z

@airhorns Timezone is too slow/complicated to support in database or analytics system, most of them use local timezone. This will not change in near term (even long term).

Given the fact that there is no way to get the time zone back, local time is better than UTC time. Spark SQL is not a storage, so you can easily change the timezone of it (this is the benefit of using local timezone).

Scala/Java have the same issue of losing timezone.

SparkQA · 2015-05-19T03:13:04Z

Test build #33038 timed out for PR 6250 at commit 6a29aa4 after a configured wait of 150m.

AmplabJenkins · 2015-05-19T03:13:08Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-19T03:13:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33038/
Test FAILed.

SparkQA · 2015-05-19T03:50:16Z

Test build #828 has started for PR 6250 at commit 6a29aa4.

SparkQA · 2015-05-19T03:59:43Z

Test build #828 has finished for PR 6250 at commit 6a29aa4.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-05-19T04:37:53Z

@davies By "local time", do you mean timezone unaware? Is the following correct?

After a round trip:

timezone unaware datetime -> timezone unaware datetime (with the same value)
timezone aware datetime -> timezone unaware datetime (with UTC value)

rxin · 2015-05-19T07:54:06Z

@airhorns Maybe long term, but unlikely in the short term since it's super complicated to support them. Of course, if somebody has time to look into scoping this (what's needed), it would be an easier discussion.

airhorns · 2015-05-19T14:00:31Z

I think the ideal case would be supporting timezone aware objects inside SparkSQL, but I understand that that is expensive and challenging. See https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Date-Time/TIMESTAMP.htm for a good description of how Vertica handles timestamps with zones. It stores them internally as UTC, but converts back to the timezone specified in the schema (if there is one) at query return time. Even if Spark doesn't store and re-convert to the local timezone specified in the schema, can we at least make a rule that all stuff is stored internally as UTC or something consistent and without ambiguity? That way, users can make expectations about what is coming out of SparkSQL, and preserve local timezone information if they care.

Also, what happens to Java/Scala land timezone-aware Calendar objects or what have you? Are they converted to local time as well?

rxin · 2015-05-19T17:59:33Z

All times are stored in UTC as far as I know. So this should match your expectation. Is there case where you found it is not?

airhorns · 2015-05-19T19:21:36Z

@davies seemed to suggest above that they are converted to server local time, not UTC. And otherwise, no, datetimes from Python land is the only case I know of which this PR addresses.

davies · 2015-05-19T20:27:22Z

@rxin It seems not, java.sql.Timestamp uses local timezone to intercept seconds:

>>> time.gmtime(1432066818)
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=19, tm_hour=20, tm_min=20, tm_sec=18, tm_wday=1, tm_yday=139, tm_isdst=0)
>>> time.localtime(1432066818)
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=19, tm_hour=13, tm_min=20, tm_sec=18, tm_wday=1, tm_yday=139, tm_isdst=1)

scala> new java.sql.Timestamp(1432066818000L).toString()
res8: String = 2015-05-19 13:20:18.0

BTW, should we keep the precious below a millisecond ? java.sql.Timestamp does support that.

davies · 2015-05-19T22:03:54Z

@mengxr After this patch, 1) yes 2) No, it's local time. So you can turn them into the timezone you care by changing the timezone of the machines.

@marmbrus Any thoughts on this?

SparkQA · 2015-05-21T18:10:41Z

Test build #844 has started for PR 6250 at commit 6a29aa4.

SparkQA · 2015-05-21T20:33:44Z

Test build #844 has finished for PR 6250 at commit 6a29aa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

airhorns · 2015-06-11T00:52:45Z

ping @davies would be great to get this in

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala

AmplabJenkins · 2015-06-11T05:37:12Z

Merged build triggered.

AmplabJenkins · 2015-06-11T05:37:18Z

Merged build started.

SparkQA · 2015-06-11T05:39:52Z

Test build #34666 has started for PR 6250 at commit 99d9d9c.

davies · 2015-06-11T05:53:25Z

@airhorns The master already support datetime with timezone in it (by #6733), this PR is updated to add a test for it.

rxin · 2015-06-11T06:06:10Z

@davies can you update the pull request description?

AmplabJenkins · 2015-06-11T06:37:13Z

Merged build triggered.

AmplabJenkins · 2015-06-11T06:37:20Z

Merged build started.

davies · 2015-06-11T06:38:05Z

@rxin updated, also support date with timezone in it.

SparkQA · 2015-06-11T06:38:19Z

Test build #34671 has started for PR 6250 at commit 44d8497.

rxin · 2015-06-11T07:31:38Z

LGTM.

SparkQA · 2015-06-11T07:56:49Z

Test build #34666 has finished for PR 6250 at commit 99d9d9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Log2(child: Expression)
- case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes

AmplabJenkins · 2015-06-11T07:56:54Z

Merged build finished. Test PASSed.

SparkQA · 2015-06-11T08:50:07Z

Test build #34671 has finished for PR 6250 at commit 44d8497.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class LeafMathExpression(c: Double, name: String)
- case class EulerNumber() extends LeafMathExpression(math.E, "E")
- case class Pi() extends LeafMathExpression(math.Pi, "PI")
- case class Log2(child: Expression)
- case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes

AmplabJenkins · 2015-06-11T08:50:12Z

Merged build finished. Test PASSed.

…ython Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time. The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time. Because of Pyrolite, datetime from SQL only has precision as 1 millisecond. This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL). Author: Davies Liu <[email protected]> Closes apache#6250 from davies/tzone and squashes the following commits: 44d8497 [Davies Liu] add timezone support for DateType 99d9d9c [Davies Liu] use int for timestamp 10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone 6a29aa4 [Davies Liu] support datetime with timezone

support datetime with timezone

6a29aa4

Davies Liu added 2 commits June 10, 2015 22:05

Merge branch 'master' of github.com:apache/spark into tzone

10aa7ca

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala

use int for timestamp

99d9d9c

davies force-pushed the tzone branch from 5383e4d to 99d9d9c Compare June 11, 2015 05:34

add timezone support for DateType

44d8497

davies changed the title ~~[SPARK-6411] [SQL] [PySpark] support datetime with timezone in Python~~ [SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python Jun 11, 2015

asfgit closed this in 424b007 Jun 11, 2015

[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python #6250

[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python #6250

Conversation

davies commented May 19, 2015

davies commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

airhorns commented May 19, 2015

davies commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

SparkQA commented May 19, 2015

mengxr commented May 19, 2015

rxin commented May 19, 2015

airhorns commented May 19, 2015

rxin commented May 19, 2015

airhorns commented May 19, 2015

davies commented May 19, 2015

davies commented May 19, 2015

SparkQA commented May 21, 2015

SparkQA commented May 21, 2015

airhorns commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015

SparkQA commented Jun 11, 2015

davies commented Jun 11, 2015

rxin commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015

davies commented Jun 11, 2015

SparkQA commented Jun 11, 2015

rxin commented Jun 11, 2015

SparkQA commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015

SparkQA commented Jun 11, 2015

AmplabJenkins commented Jun 11, 2015