-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python #6250
Conversation
cc @mengxr |
Merged build triggered. |
Merged build started. |
Test build #33038 has started for PR 6250 at commit |
@davies do you think SparkSQL will ever support timezones? This patch means that round-tripping any timezone aware datetime through SparkSQL will both lose the original timezone information and return the value timezone naive, which I think isn't really any better than the current state. I imagine it will be very surprising and frustrating for developers who aren't intimately familiar with why Spark might choose to behave this way. What about adding a new rule that all Dates or Calendars in SparkSQL are converted to UTC before being stored internally or something like that? Sounds like a performance killer, but don't Java/Scala devs have this same loss of information issue? |
@airhorns Timezone is too slow/complicated to support in database or analytics system, most of them use local timezone. This will not change in near term (even long term). Given the fact that there is no way to get the time zone back, local time is better than UTC time. Spark SQL is not a storage, so you can easily change the timezone of it (this is the benefit of using local timezone). Scala/Java have the same issue of losing timezone. |
Test build #33038 timed out for PR 6250 at commit |
Merged build finished. Test FAILed. |
Test FAILed. |
Test build #828 has started for PR 6250 at commit |
Test build #828 has finished for PR 6250 at commit
|
@davies By "local time", do you mean timezone unaware? Is the following correct? After a round trip:
|
@airhorns Maybe long term, but unlikely in the short term since it's super complicated to support them. Of course, if somebody has time to look into scoping this (what's needed), it would be an easier discussion. |
I think the ideal case would be supporting timezone aware objects inside SparkSQL, but I understand that that is expensive and challenging. See https://my.vertica.com/docs/7.1.x/HTML/Content/Authoring/SQLReferenceManual/DataTypes/Date-Time/TIMESTAMP.htm for a good description of how Vertica handles timestamps with zones. It stores them internally as UTC, but converts back to the timezone specified in the schema (if there is one) at query return time. Even if Spark doesn't store and re-convert to the local timezone specified in the schema, can we at least make a rule that all stuff is stored internally as UTC or something consistent and without ambiguity? That way, users can make expectations about what is coming out of SparkSQL, and preserve local timezone information if they care. Also, what happens to Java/Scala land timezone-aware Calendar objects or what have you? Are they converted to local time as well? |
All times are stored in UTC as far as I know. So this should match your expectation. Is there case where you found it is not? |
@davies seemed to suggest above that they are converted to server local time, not UTC. And otherwise, no, |
@rxin It seems not, java.sql.Timestamp uses local timezone to intercept seconds:
BTW, should we keep the precious below a millisecond ? java.sql.Timestamp does support that. |
Test build #844 has started for PR 6250 at commit |
Test build #844 has finished for PR 6250 at commit
|
ping @davies would be great to get this in |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala
Merged build triggered. |
Merged build started. |
Test build #34666 has started for PR 6250 at commit |
@davies can you update the pull request description? |
Merged build triggered. |
Merged build started. |
@rxin updated, also support date with timezone in it. |
Test build #34671 has started for PR 6250 at commit |
LGTM. |
Test build #34666 has finished for PR 6250 at commit
|
Merged build finished. Test PASSed. |
Test build #34671 has finished for PR 6250 at commit
|
Merged build finished. Test PASSed. |
…ython Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time. The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time. Because of Pyrolite, datetime from SQL only has precision as 1 millisecond. This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL). Author: Davies Liu <[email protected]> Closes apache#6250 from davies/tzone and squashes the following commits: 44d8497 [Davies Liu] add timezone support for DateType 99d9d9c [Davies Liu] use int for timestamp 10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone 6a29aa4 [Davies Liu] support datetime with timezone
Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.
The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.
Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.
This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).