-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics #6832
Conversation
cba0451
to
5d54984
Compare
5d54984
to
68a89b3
Compare
@@ -430,15 +430,11 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp; | |||
---- | |||
1980-02-01T12:00:00 | |||
|
|||
query D | |||
query error DataFusion error: Error during planning: Cannot coerce arithmetic expression Interval\(MonthDayNano\) \- Date32 to valid types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query was only supported following #6778 and was done so accidentally - #6778 (comment)
I tested some other systems, duckdb explicitly doesn't support this, and I couldn't work out how to get mysql to express this, so I think this is fine
"+-----+---------+", | ||
"| val | ts_diff |", | ||
"+-----+---------+", | ||
"| 3 | PT30S |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp arithmetic now returns the appropriate duration type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the PT30S
format is some sort of standard but I think it is a pretty terrible UX and very uncommon.
I know there is a technical difference between duration and interval in the arrow spec but as a user, I find this very confusing:
❯ select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S |
+----------------------------+
On the other hand the original formatting is not consistent with Postgres either.
postgres=# select interval '5 minutes 2 days';
interval
-----------------
2 days 00:05:00
(1 row)
postgres=# select interval '5 minutes';
interval
----------
00:05:00
(1 row)
I played around with this locally and we can still get an interval
with an explicit cast
❯ select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S |
+----------------------------+
1 row in set. Query took 0.002 seconds.
❯ select (now() - '2023-01-01'::timestamp)::interval;
+------------------------------------------------------------+
| now() - Utf8("2023-01-01") |
+------------------------------------------------------------+
| 0 years 0 mons 0 days 4772 hours 15 mins 35.571791000 secs |
+------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.
So we can perhaps add the appropriate coercion rules if necessary
Or perhaps we can change the formatting of Durations in arrow-rs to be more interval like
@waitingkuo do you have any thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to change the formatting upstream if there is consensus on what it should be. We are currently just doing what chrono does, which appears to be ISO 8601
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update here is that @tustvold and I spoke, and the next action is I will file a ticket explaining the behavior change and trying to build some consensus on what to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the coercion rules might not work well.
❯ select ('2023-01-02'::timestamp - '2023-01-01'::timestamp)::interval;
+--------------------------------------------------------+
| Utf8("2023-01-02") - Utf8("2023-01-01") |
+--------------------------------------------------------+
| 0 years 0 mons 0 days 24 hours 0 mins 0.000000000 secs |
+--------------------------------------------------------+
1 row in set. Query took 0.002 seconds.
it's converted to (0 month, 0 day, 86400000000000 nanos) instead of (0 month, 1 day, 0 nanos)
which isn't consist with
❯ select ('2023-01-02'::timestamp - '2023-01-01'::timestamp);
+-----------------------------------------+
| Utf8("2023-01-02") - Utf8("2023-01-01") |
+-----------------------------------------+
| P1D |
+-----------------------------------------+
1 row in set. Query took 0.002 seconds.
both formats are fine so we. in case we decide to align with interval format, we should discuss whether it outputs
0 years 0 mons 1 days 0 hours 0 mins 0 secs
or
0 years 0 mons 0 days 24 hours 0 mins 0 secs
in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably change duration output to be a number of the duration unit, i.e. 5000 seconds
, or 2500 milliseconds
, etc... to avoid potential ambiguity over what the duration represents. It might be confusing if both intervals and durations report some number of days, but arithmetic involving them has different semantics across DST boundaries. I will file an upstream ticket and work on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have filed two tickets of interest:
- Use
Duration
as the result oftimestamp - timestamp
rather thanInterval
#7068 which tries to describes the change in semantics this PR does - Special case Duration display in datafusion-cli and sqllogictest #7070 that proposes special casing the display of Duration, based on @waitingkuo 's feedback above
@@ -5912,285 +5859,6 @@ mod tests { | |||
check_array(array); | |||
} | |||
|
|||
fn get_timestamp_test_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is removed as timestamp arithmetic no longer returns intervals (which avoids this whole mess of edge cases)
3 0 years 0 mons 15952 days 23 hours 22 mins 12.667123455 secs | ||
2 0 years 0 mons 8406 days 1 hours 1 mins 54.877123455 secs | ||
1 0 years 0 mons 53 days 16 hours 0 mins 20.000000024 secs | ||
3 P15952DT84132.667123455S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does highlight another fun quirk of intervals - they sort strangely
07e83f7
to
5180300
Compare
@@ -236,9 +179,6 @@ mod tests { | |||
test_array_negative_op!(Int64, 23456i64, 12345i64); | |||
test_array_negative_op!(Float32, 2345.0f32, 1234.0f32); | |||
test_array_negative_op!(Float64, 23456.0f64, 12345.0f64); | |||
test_array_negative_op_intervals!(YearMonth, 2345i32, 1234i32); | |||
test_array_negative_op_intervals!(DayTime, 23456i64, 12345i64); | |||
test_array_negative_op_intervals!(MonthDayNano, 234567i128, 123456i128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved into sqllogic, unfortunately I couldn't find a way to get a MonthDayNano type, but this is better than nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do something like this
select arrow_cast(interval '30 minutes', 'Interval(MonthDayNano)');
"+-----+---------------------------------------------------+", | ||
"| 3 | 0 years 0 mons 0 days 10 hours 0 mins 30.000 secs |", | ||
"| 1 | 0 years 0 mons 0 days 10 hours 0 mins 20.000 secs |", | ||
"| 2 | 0 years 0 mons 0 days 10 hours 0 mins 10.000 secs |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually wrong before, the logic in make_timestamp_tz_sub_table initializes the arrays as follows
let timestamps1 = vec![
1_678_892_420_000_000_000i64 / divisor, //2023-03-15T15:00:20.000_000_000
1_678_892_410_000_000_000i64 / divisor, //2023-03-15T15:00:10.000_000_000
1_678_892_430_000_000_000i64 / divisor, //2023-03-15T15:00:30.000_000_000
];
let timestamps2 = vec![
1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
];
let array1 =
PrimitiveArray::<A>::from_iter_values(timestamps1).with_timezone_opt(tz1);
let array2 =
PrimitiveArray::<A>::from_iter_values(timestamps2).with_timezone_opt(tz2);
Values stored in timestamp arrays are always with respect to the UTC epoch, therefore when subtracting timestamps the timezone should not matter
@viirya / @waitingkuo / @liukun4515 I wonder if you could verify this PR is acceptable (specifically the Decimal display differences as well as the change in decimal division semantics) |
Does this also fix #6794 (see #5646 (comment)) |
It should do, but I haven't verified this |
I have posted an announcement about this PR to the mailing list: https://lists.apache.org/thread/837ghhjh9gd1kvg0qqnmxqs0rm62x0xt I will plan to merge it in Monday Aug 7 2023 unless I hear otherwise |
1 similar comment
I have posted an announcement about this PR to the mailing list: https://lists.apache.org/thread/837ghhjh9gd1kvg0qqnmxqs0rm62x0xt I will plan to merge it in Monday Aug 7 2023 unless I hear otherwise |
apache/arrow-rs#4640 proposes a further tweak to the decimal arithmetic rules. In light of #6828 where decimal division currently silently returns nonsensical answers, I'm not sure this should be a blocker to getting this in, but defer to the consensus. Worst case we could do a patch release for arrow-rs, but I would rather avoid this if possible |
I am going to update this PR and resolve conflicts |
I believe we have resolved all concerns on this PR and it represents a significant step forward in terms of correctness I plan to merge it in once CI has passed. |
I am running this update against the IOx tests as one final check: https://github.com/influxdata/influxdb_iox/pull/8452 |
Our tests passed. 🚀 Onward |
I only concern the change of decimal division semantics could be an impact to some customers potentially if high precision decimal results are important to them. Ideally I'd like to get same precision results as Spark but under current design I think it might be tricky to have two different precision rules for decimal operations. It could possibly make DataFusion complicated on decimal operation too. So it may not be a good idea. Among existing systems, I also don't know if any provides alternative option regarding decimal precision rule. |
apache/arrow-rs#4664 tracks further upstream work related to decimal arithmetic. After that we will have some more options in this space, and can potentially consider a different approach to decimal division inline with Spark. I am currently working on this, but until then the use of the Spark rule is just a recipe for overflow errors, as we have seen |
@tustvold @alamb @viirya |
apache/arrow-rs#4664 (comment) summarises my thoughts on the issue having spent significant time on this. TLDR is we already support a higher precision than most commercial systems, and given this and the fact I personally have no use-cases for decimals, I find it hard to justify spending more of my time on this. For people looking to emulate spark which only supports precision up to 38, casting to Decimal256 and then truncating down to Decimal128 will be equivalent, and is what a precision loss arithmetic kernel would do |
I filed #7301 to track possibly adding a spark decimal compatibility mode if anyone needs it |
I know it. But so what? You mean that we ask Spark users to disable it so Spark can be consistent with DataFusion?? |
Which issue does this PR close?
Closes #6828
Closes #6933
Closes #7068
Rationale for this change
Use new kernels added in apache/arrow-rs#4465
Make timestamp arithmetic kernels more precse
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?