-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow precision loss on multiplying decimal arrays #3690
Conversation
09acd95
to
7990356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a dumb question, but can we not do something with precision statically instead of per-value. i.e. if multiplying two decimal numbers with precision of 10 and scale 5, the output should have precision of 20 and scale 10. The plan can then intentionally truncate inputs using division if necessary?
This might be another dumb thought, but if you want truncating arithmetic, why not just cast to a floating point representation? It'll be faster and better supported? I guess I had thought the only reason to still bother with decimal precision these days was if you couldn't tolerate approximation?
arrow-arith/src/arithmetic.rs
Outdated
let precision = left.precision(); | ||
let mut product_scale = left.scale() + right.scale(); | ||
|
||
math_checked_op(left, right, |a, b| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a stupid question, I know very little about decimal arithmetic, but if you are just going to truncate the result why do you need to perform multiplication as the larger type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of multiplication between smaller type (i.e. i128
) could be overflowed but it could still be represented as i128 after we truncate it below. If we don't perform multiplication as the larger type, I'm not sure if we can get the (overflowed) result and truncate on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I hadn't got to the truncation logic when I wrote this 😄
arrow-arith/src/arithmetic.rs
Outdated
.map(|mut a| { | ||
// Round the value if an exact representation is not possible. | ||
// ref: java.math.BigDecimal#doRound | ||
let mut digits = a.to_string().len() as i8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log10
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log10
on i256
?
arrow-arith/src/arithmetic.rs
Outdated
while diff > 0 { | ||
let divisor = i256::from_i128(10).pow_wrapping(diff as u32); | ||
a = divide_and_round::<Decimal256Type>(a, divisor); | ||
product_scale -= diff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct, as it won't "recompute" the values that have already been evaluated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a
? It is divided and rounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If say the 5th value of the array triggers a reduction in precision, we will apply the truncation to all values that follow, but the first four values will have been computed with the original product_scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, I think you're correct. I forgot that in JVM that each value is a decimal object which keeps precision/scale by itself. So it allows each value has different precision/scale after the computation. I need to rethink this for a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the behavior we're going to match is under row-based and object-based, it is somehow different in many aspects regarding being an array kernel. Need to think more about it with a proper fix. 🤔
Hmm, I think one reason is that we are not sure if truncation will actually happen before multiplying values. If not, early truncating inputs could loss precision which can be avoided?
A major reason for this work is to provide compatibility to existing queries. If our users happened to use decimal for whatever reasons and they just rely on such truncation behavior, they may not tolerate a different behavior like failing their queries. Besides, as truncating is not always happened but just a possibility, they might tolerate approximation for it for rare case (if happened) but for all values (this is what I thought, not voice from real users. so major reason is still compatibility). |
I want to review this PR in my time tomorrow. @viirya |
b2cb3fd
to
df9e047
Compare
FWIW @jackwener has identified that this PR will fix apache/datafusion#5396 downstream in DataFusion |
I have to confess to not really being sure how to move forward with this, as I have a number of concerns:
My understanding is this is trying to emulate the behaviour of Spark decimals, which store precision and scale per value. However, I'm not sure this PR actually achieves this, in the event of overflow this PR will truncate the precision of all values in the array, potentially truncating values that wouldn't have been truncated by Spark. I'm therefore not really sure what this PR gains us, it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here? |
I think another concern is that this trade comes with a significant performance cost for the non overflow case What if we changed the DataFusion type coercion code so that it was able to handle the large literal cases Like special cased So that it could handle a literal that was too large for its declared type? |
In Spark, decimal expression will be followed by an expression to round decimals. Spark analyzer does type coercion for decimal arithmetic operations and inserts such expression with required precision/scale. To emulate this too, we have a similar expression so that it produces decimals with same precision/scale as Spark. In Spark, precision/scale is stored in decimal objects. Once truncating happens, it only truncates what needed to be truncated. In the follow up expression, it will be adjusted to required precision/scale if possible. But as we just have i128 values, we cannot just truncate single value there as it results in inconsistent precision/scale. So it turns out to be current approach that truncates all values to same precision/scale at the end of this kernel. We have a similar expression as mentioned above that adjusts decimals to required precision/scale for the arithmetic operation. Alternative thing to do for this, I think, is to produce an internal struct of i128, precision and scale during the operation. So we don't need to truncate other values. The kernel also takes required precision/scale as parameters and adjusts decimal values accordingly. |
So if I am following correctly, take the example of an array containing In spark this will result in
With this PR this will result in a new array with scale 4 and precision 3 containing
I don't follow how this is comparable behaviour?
We still ultimately need to convert to the arrow representation, which does not have per-value precision and scale. I don't see how you can avoid this, or to put it another way, I'm not sure compatibility with Spark's decimal representation is tractable... |
I guess the some values are incorrect.
Spark will insert a followup expression to make a final precision/scale for decimal. So the final result will be adjusted to same precision/scale again. |
But I thought we just established that the values don't fit into the original precision / scale? I'm sorry but I'm really confused... |
Not sure I get it correctly. But the adjustment is to make sure all values have same precision/scale if possible. The precision/scale can be inconsistent due to truncating per value. The followup expression will adjust them to same precision/scale (depending on the arithmetic op, decided by Spark type coercion rule). If value cannot be fit into it, it could be null or runtime exception, depending on config. |
Could you expand on this, how does it decide the output precision/scale to unify to? Is it the same as the input? Is it data dependent? Unless I am mistaken in arrow the precision and scale are an immutable part of the column's schema that must be established at plan time. I wonder if the issue is that DataFusion is not inspecting the precision in order to work out if there is a potential for overflow, and scaling the input as appropriate based on this information? This is what I had understood the whole purpose of the precision argument as being for? |
Spark type coercion rule decides the output precision/scale: e.g., https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala#L126-L136. It is dependent on input precision/scale. Yea, it is established at plan (query analysis) time.
I think it is not possible to inspect potential overflow for input. Because overflowing or not isn't just dependent on precision/scale only but the actual input data. |
Ok, thank you for your patience with me. Would the following therefore be workable and consistent with Spark.
We can then extend this so that if the output precision exceeds the maximum precision of the decimal type, and an allowDecimalPrecisionLoss config option is true, instead of returning a plan error DataFusion can reduce the precision of the result. Looking at the Spark logic for this, it only does this when it will leave at least 6 digits of fractional precision in the result. To facilitate this we can add a I think this would address my concerns with this PR as it would be explicitly opt-in, deterministic, and have adequate performance. It would also have potential applications outside of just decimals. What do you think? |
I remember DataFusion already has type coercion rule for decimal type that copies the behavior from Spark's one (the posted code link above).
This is just the constraint on how a DecimalType is constructed. Obviously it isn't allowed to create a DecimalType over largest precision. It doesn't actually affect decimal arithmetic op.
So this is not correct actually. Spark has an expression to check overflow of decimal arithmetic op result. It checks if the value can fit into required precision/scale. If not, getting a null or exception, depending on config.
Let me figure it out. So this sounds similar to this PR but takes a scale parameter. After multiplication, truncating the result according to the required scale instead of minimum scale? |
df9e047
to
c9d61ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, need to give this a review with fresh eyes, but left some initial comments
|
||
assert_eq!(&expected, &result); | ||
|
||
// Rounded the value "1385459527.234567901207133052876543209876543210". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Thanks for review. I will add |
* Add multiply_decimal. * Fix scale for multiple value * Fix * Add doc * Update * Rename to mul_fixed_point_checked * For review * More * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Fix --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #3689.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?