Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python #6250

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented May 19, 2015

Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.

The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.

Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.

This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).

@davies
Copy link
Contributor Author

davies commented May 19, 2015

cc @mengxr

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33038 has started for PR 6250 at commit 6a29aa4.

@airhorns
Copy link

@davies do you think SparkSQL will ever support timezones? This patch means that round-tripping any timezone aware datetime through SparkSQL will both lose the original timezone information and return the value timezone naive, which I think isn't really any better than the current state. I imagine it will be very surprising and frustrating for developers who aren't intimately familiar with why Spark might choose to behave this way. What about adding a new rule that all Dates or Calendars in SparkSQL are converted to UTC before being stored internally or something like that? Sounds like a performance killer, but don't Java/Scala devs have this same loss of information issue?

@davies
Copy link
Contributor Author

davies commented May 19, 2015

@airhorns Timezone is too slow/complicated to support in database or analytics system, most of them use local timezone. This will not change in near term (even long term).

Given the fact that there is no way to get the time zone back, local time is better than UTC time. Spark SQL is not a storage, so you can easily change the timezone of it (this is the benefit of using local timezone).

Scala/Java have the same issue of losing timezone.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33038 timed out for PR 6250 at commit 6a29aa4 after a configured wait of 150m.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33038/
Test FAILed.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #828 has started for PR 6250 at commit 6a29aa4.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #828 has finished for PR 6250 at commit 6a29aa4.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 19, 2015

@davies By "local time", do you mean timezone unaware? Is the following correct?

After a round trip:

  1. timezone unaware datetime -> timezone unaware datetime (with the same value)
  2. timezone aware datetime -> timezone unaware datetime (with UTC value)

@rxin
Copy link
Contributor

rxin commented May 19, 2015

@airhorns Maybe long term, but unlikely in the short term since it's super complicated to support them. Of course, if somebody has time to look into scoping this (what's needed), it would be an easier discussion.

@airhorns
Copy link

I think the ideal case would be supporting timezone aware objects inside SparkSQL, but I understand that that is expensive and challenging. See https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Date-Time/TIMESTAMP.htm for a good description of how Vertica handles timestamps with zones. It stores them internally as UTC, but converts back to the timezone specified in the schema (if there is one) at query return time. Even if Spark doesn't store and re-convert to the local timezone specified in the schema, can we at least make a rule that all stuff is stored internally as UTC or something consistent and without ambiguity? That way, users can make expectations about what is coming out of SparkSQL, and preserve local timezone information if they care.

Also, what happens to Java/Scala land timezone-aware Calendar objects or what have you? Are they converted to local time as well?

@rxin
Copy link
Contributor

rxin commented May 19, 2015

All times are stored in UTC as far as I know. So this should match your expectation. Is there case where you found it is not?

@airhorns
Copy link

@davies seemed to suggest above that they are converted to server local time, not UTC. And otherwise, no, datetimes from Python land is the only case I know of which this PR addresses.

@davies
Copy link
Contributor Author

davies commented May 19, 2015

@rxin It seems not, java.sql.Timestamp uses local timezone to intercept seconds:

>>> time.gmtime(1432066818)
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=19, tm_hour=20, tm_min=20, tm_sec=18, tm_wday=1, tm_yday=139, tm_isdst=0)
>>> time.localtime(1432066818)
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=19, tm_hour=13, tm_min=20, tm_sec=18, tm_wday=1, tm_yday=139, tm_isdst=1)
scala> new java.sql.Timestamp(1432066818000L).toString()
res8: String = 2015-05-19 13:20:18.0

BTW, should we keep the precious below a millisecond ? java.sql.Timestamp does support that.

@davies
Copy link
Contributor Author

davies commented May 19, 2015

@mengxr After this patch, 1) yes 2) No, it's local time. So you can turn them into the timezone you care by changing the timezone of the machines.

@marmbrus Any thoughts on this?

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #844 has started for PR 6250 at commit 6a29aa4.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #844 has finished for PR 6250 at commit 6a29aa4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@airhorns
Copy link

ping @davies would be great to get this in

Davies Liu added 2 commits June 10, 2015 22:05
Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34666 has started for PR 6250 at commit 99d9d9c.

@davies
Copy link
Contributor Author

davies commented Jun 11, 2015

@airhorns The master already support datetime with timezone in it (by #6733), this PR is updated to add a test for it.

@rxin
Copy link
Contributor

rxin commented Jun 11, 2015

@davies can you update the pull request description?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@davies davies changed the title [SPARK-6411] [SQL] [PySpark] support datetime with timezone in Python [SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python Jun 11, 2015
@davies
Copy link
Contributor Author

davies commented Jun 11, 2015

@rxin updated, also support date with timezone in it.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34671 has started for PR 6250 at commit 44d8497.

@rxin
Copy link
Contributor

rxin commented Jun 11, 2015

LGTM.

@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34666 has finished for PR 6250 at commit 99d9d9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Log2(child: Expression)
    • case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@asfgit asfgit closed this in 424b007 Jun 11, 2015
@SparkQA
Copy link

SparkQA commented Jun 11, 2015

Test build #34671 has finished for PR 6250 at commit 44d8497.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LeafMathExpression(c: Double, name: String)
    • case class EulerNumber() extends LeafMathExpression(math.E, "E")
    • case class Pi() extends LeafMathExpression(math.Pi, "PI")
    • case class Log2(child: Expression)
    • case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…ython

Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.

The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.

Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.

This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).

Author: Davies Liu <[email protected]>

Closes apache#6250 from davies/tzone and squashes the following commits:

44d8497 [Davies Liu] add timezone support for DateType
99d9d9c [Davies Liu] use int for timestamp
10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone
6a29aa4 [Davies Liu] support datetime with timezone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants