-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: decimal conversion looses value on lower precision #6836
Conversation
@andygrove @viirya @alamb please take a look. |
arrow-cast/src/cast/decimal.rs
Outdated
@@ -112,8 +112,19 @@ where | |||
}; | |||
|
|||
Ok(match cast_options.safe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to match
on a boolean rather than just using an if
statement. I know this is how the existing code was, but perhaps we could improve this while we are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The logic now matches the logic in cast_floating_point_to_decimal128
. Thanks @himadripal
arrow-cast/src/cast/mod.rs
Outdated
let result = cast_with_options(&array, &output_type, &options); | ||
|
||
assert_eq!(result.unwrap_err().to_string(), | ||
"InvalidArgumentError(123456789 is too large to store in a Decimal128 of precision 6. Max is 999999)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message format needs updating:
left: "Invalid argument error: 123456789 is too large to store in a Decimal256 of precision 6. Max is 999999"
right: "InvalidArgumentError(123456789 is too large to store in a Decimal256 of precision 6. Max is 999999)"
There is a regression in let array = vec![Some(i128::MAX), Some(i128::MIN)];
let input_decimal_array = create_decimal_array(array, 10, 3).unwrap(); I would expect this to fail, so we probably need to add the same validation there. |
On second thoughts, it seems it was an intentional design decision not to validate this on array creation. Instead, an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we run any benchmarks (I'm not sure if any actually exist) to confirm this doesn't significantly regress performance.
It seems unfortunate to be always performing overflow checks, when in many cases it should be possible to prove that precision overflow can't occur and need not be checked for
arrow-cast/src/cast/decimal.rs
Outdated
@@ -156,19 +179,9 @@ where | |||
T::Native: DecimalCast + ArrowNativeTypeOp, | |||
{ | |||
let array: PrimitiveArray<T> = match input_scale.cmp(&output_scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be changed to <=
arrow-cast/src/cast/decimal.rs
Outdated
array.try_unary(|x| { | ||
f(x).ok_or_else(|| error(x)) | ||
.and_then(|v|{ | ||
O:: validate_decimal_precision(v, output_precision).map(|_| v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this when computing the value (the code above), and only when this can fail?
(for example widening precision cannot fail, so no need to spend cycles on validating the produced values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it, but then there are other examples where is it done this way, so kept it there. We can also do this as part of first error check point - inside this method
let error = cast_decimal_to_decimal_error::<I, O>(output_precision, output_scale);
I'll create a separate PR (probably tomorrow) to add some criterion benchmarks |
Changed the test to pass. |
// input_scale < output_scale | ||
let array: PrimitiveArray<T> = if input_scale <= output_scale { | ||
// input_scale <= output_scale | ||
// the scale doesn't change, but precision may change and cause overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we also check precision and skip convert_to_bigger_or_equal_scale_decimal
if it won't overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't it still has to go through the convert_to_bigger_or_equal_scale_decimal
to add the zero at the end for bigger precision conversion. IMO, we can only skip the call if precision is less and scales are equal. Please correct me here.
we can do check at the beginning like this
if (array.validate_decimal_precision().is_err())
but we need to return null for safe=true
and throw error for safe=false
is that what you are suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, original code simply returns the original array for same scale. If the new precision is bigger, I think we can still do that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets assume, we are converting 12.345
from (5, 3) to (6,3), then it needs to be . Then it should still be 12.345123450
- if we do array.clone() as per original code- wondering how it will add the 0 at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is correct, then we can return original array if precision is bigger and scale is equal.
select arrow_typeof(cast(cast(1.23 as decimal(10,3)) as decimal(12,3))),
cast(cast(1.23 as decimal(10,3)) as decimal(12,3));
----
Decimal128(12, 3) 1.23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12345 in (5, 3) is 12.345. When casting to (6, 3), it is still 12345, why it needs to be 123450? 123450 in (6, 3) is 123.45. If I understand it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. fn signature only has output_precision, I can get the input_precision from the calling fn, it's available there and change the signature to include input_precision. would that be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. @viirya please check.
…eded. revert whitespace changes formatting check
68b0f68
to
6c50fe3
Compare
Can anyone please let the build run - workflows waiting for approval. |
I created a simple benchmark for decimal casting in #6850. Unsurprisingly, validating that the results are correct is slower than not validating the results. before
after (this PR)
We currently have the config option of
So, yes, it is a performance regression, but the previous behavior was incorrect. This PR now makes this work as advertised. @tustvold Is there a use case we need to support for faster casts without validating results per the CastOptions? |
No, the cast should be checked, apologies if that wasn't clear, my concern was the PR as originally formulated blindly performed the checked conversion regardless of the input type, even when the cast was increasing the precision. Given the whole purpose of tracking precision is to avoid overflow checks, it seemed a little off. I'll try to find some time to take another look at this PR as from a quick scan this looks to have been addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still overly pessimistic, for example, if I increase both the precision and scale by the same amount I shouldn't need to perform any checks.
However, the kernel is already fallible (try_unary/unary_opt vs unary) and so this kernel already won't be vectorizing properly, and so it isn't like we're regressing a highly optimised kernel here.
I've therefore instead opted to file #6877 and if people care they can action it.
* decimal conversion looses value on lower precision, throws error now on overflow. * fix review comments and fix formatting. * for simple case of equal scale and bigger precision, no conversion needed. revert whitespace changes formatting check --------- Co-authored-by: himadripal <[email protected]>
* decimal conversion looses value on lower precision, throws error now on overflow. * fix review comments and fix formatting. * for simple case of equal scale and bigger precision, no conversion needed. revert whitespace changes formatting check --------- Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: himadripal <[email protected]>
Which issue does this PR close?
Closes #6833
Closes #.
Rationale for this change
Decimal128 to Decimal128 with smaller precision produces incorrect results in some cases.
What changes are included in this PR?
It adds a decimal validation after conversion to check if the converted result can fit into the specified precision and scale
Are there any user-facing changes?