-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC Support date_bin
on timestamps with timezone
#9
Conversation
cb0e2c4
to
d642b49
Compare
34cba19
to
ce4e560
Compare
"2020-09-08T00:00:00+05", | ||
"2020-09-08T00:00:00+05", | ||
"2020-09-08T00:00:00+05", | ||
"2020-09-08T00:00:00+05", | ||
"2020-09-08T00:00:00+05", | ||
"2020-09-07T19:00:00+05:00", | ||
"2020-09-07T19:00:00+05:00", | ||
"2020-09-07T19:00:00+05:00", | ||
"2020-09-07T19:00:00+05:00", | ||
"2020-09-07T19:00:00+05:00", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not right.. this is a change in behavior
With the implementation in 54f1bdc, there is a change in behavior #9 (comment), which is not right. Then I looked into Postgres to see how it works. This is what I found:
postgres =# SELECT
date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00'),
date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00'),
date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'Europe/Brussels',
date_bin('1 d', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels'
;
date_bin | date_bin | date_bin | date_bin | timezone | timezone
------------------------+---------------------+---------------------+------------------------+---------------------+------------------------
2021-10-31 00:00:00+00 | 2021-10-31 00:00:00 | 2021-10-31 00:00:00 | 2021-10-30 00:00:00+00 | 2021-10-31 02:00:00 | 2021-10-30 22:00:00+00
(1 row) This is very different from what DataFusion is currently doing, with cargo build from the current main branch: > SELECT
date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AS date_bin_1,
date_bin(interval '1 day', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00') AS date_bin_2,
date_bin(interval '1 day', '2021-10-31T01:00:00' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00') AS date_bin_3,
date_bin(interval '1 day', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00') AS date_bin_4,
date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'Europe/Brussels' AS date_bin_5,
date_bin(interval '1 day', '2021-10-31T01:00:00', '1970-01-01T00:00:00') AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels' AS date_bin_6
;
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| date_bin_1 | date_bin_2 | date_bin_3 | date_bin_4 | date_bin_5 | date_bin_6 |
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| 2021-10-31T00:00:00 | 2021-10-31T00:00:00 | 2021-10-30T02:00:00+02:00 | 2021-10-31T02:00:00+02:00 | 2021-10-31T00:00:00+02:00 | 2021-10-31T02:00:00+02:00 |
+---------------------+---------------------+---------------------------+---------------------------+---------------------------+---------------------------+
1 row(s) fetched.
Elapsed 0.031 seconds.
postgres =# SELECT
pg_typeof('2024-03-31T00:30:00') AS type_unknown,
pg_typeof('2024-03-31T00:30:00' AT TIME ZONE 'Europe/Brussels') AS type_wo_tz, -- 1 casting
pg_typeof('2024-03-31T00:30:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') AS type_w_tz, -- 2 casting
pg_typeof('2024-03-31T00:30:00'::timestamp AT TIME ZONE 'UTC') AS type_w_tz, -- 2 casting
pg_typeof('2024-03-31T00:30:00'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') AS type_wo_tz -- 3 casting
;
type_unknown | type_wo_tz | type_w_tz | type_w_tz | type_wo_tz
--------------+-----------------------------+--------------------------+--------------------------+-----------------------------
unknown | timestamp without time zone | timestamp with time zone | timestamp with time zone | timestamp without time zone
(1 row) |
My proposal is:
|
This makes sense to me
If we made the change proposed in apache/arrow-rs#5827, would that "solve" all these issues? If so, how would someone implement timezone aware date binning? |
I'm not sure. I think we need more discussion to define the behavior. This is how postgres behaves:
For example, in postgres: postgres =# select
pg_typeof('2021-10-31T01:00:00'::timestamp) as input_datatype,
date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00'),
pg_typeof(date_bin('1 d', '2021-10-31T01:00:00'::timestamp, '1970-01-01T00:00:00')) as result_datetype;
input_datatype | date_bin | result_datetype
-----------------------------+---------------------+-----------------------------
timestamp without time zone | 2021-10-31 00:00:00 | timestamp without time zone
(1 row)
postgres=# show timezone;
TimeZone
-----------------
America/Chicago
(1 row)
postgres=# select
pg_typeof('2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') as input_datatype,
date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
pg_typeof(date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00')) as result_datetype;
input_datatype | date_bin | result_datetype
--------------------------+------------------------+--------------------------
timestamp with time zone | 2021-10-30 01:00:00-05 | timestamp with time zone
-- -05 is the system timezone, it is 01:00 instead of 00:00 because the origin time is '1970-01-01T00:00:00'
# if cast the origin to 'Europe/Brussels' timezone, the result will be at 00:00
postgres=# select date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels');
date_bin
------------------------
2021-10-31 00:00:00-05
(1 row)
postgres=# set time zone 'UTC';
SET
postgres=# show timezone;
TimeZone
----------
UTC
(1 row)
postgres=# select
pg_typeof('2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels') as input_datatype,
date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00'),
pg_typeof(date_bin('1 d', '2021-10-31T01:00:00' AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels', '1970-01-01T00:00:00')) as result_datetype;
input_datatype | date_bin | result_datetype
--------------------------+------------------------+--------------------------
timestamp with time zone | 2021-10-30 00:00:00+00 | timestamp with time zone
-- -01 is the system timezone, instead of the input timezone
(1 row) |
Here is the short term fix PR: apache#11347 For long term fix, there are still discussions going on in apache/arrow-rs#5827 which is the first step of the long term fix. At this point, I conclude the POC is done, so closing this draft PR. |
This PR is a POC to support
date_bin
on timestamps with timezone. There are two things covered in this PR:to_local_time()
:Timestamp(..., *)
Timestamp(..., None)
Combine
to_local_time()
withdate_bin()
will look like:date_bin
on timestamps with timezone -- the original date_bin works correctly. the confusion happens at the display of the timestamp when there is a timezone. Even though the underlying time, the UTC time is correct, but when there is a timezone, the display automatically adjust the time, which result the display time to be wrong. Adjusting the underlaying time can make the final display time to our expected result.