Allow precision loss on multiplying decimal arrays #3690

viirya · 2023-02-10T17:19:04Z

Which issue does this PR close?

Closes #3689.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold

This might be a dumb question, but can we not do something with precision statically instead of per-value. i.e. if multiplying two decimal numbers with precision of 10 and scale 5, the output should have precision of 20 and scale 10. The plan can then intentionally truncate inputs using division if necessary?

This might be another dumb thought, but if you want truncating arithmetic, why not just cast to a floating point representation? It'll be faster and better supported? I guess I had thought the only reason to still bother with decimal precision these days was if you couldn't tolerate approximation?

tustvold · 2023-02-12T18:20:26Z

arrow-arith/src/arithmetic.rs

+    let precision = left.precision();
+    let mut product_scale = left.scale() + right.scale();
+
+    math_checked_op(left, right, |a, b| {


This might be a stupid question, I know very little about decimal arithmetic, but if you are just going to truncate the result why do you need to perform multiplication as the larger type?

The result of multiplication between smaller type (i.e. i128) could be overflowed but it could still be represented as i128 after we truncate it below. If we don't perform multiplication as the larger type, I'm not sure if we can get the (overflowed) result and truncate on it?

Makes sense, I hadn't got to the truncation logic when I wrote this 😄

arrow-arith/src/arithmetic.rs

tustvold · 2023-02-12T18:21:56Z

arrow-arith/src/arithmetic.rs

+            .map(|mut a| {
+                // Round the value if an exact representation is not possible.
+                // ref: java.math.BigDecimal#doRound
+                let mut digits = a.to_string().len() as i8;


log10 perhaps?

log10 on i256?

tustvold · 2023-02-12T18:29:34Z

arrow-arith/src/arithmetic.rs

+                while diff > 0 {
+                    let divisor = i256::from_i128(10).pow_wrapping(diff as u32);
+                    a = divide_and_round::<Decimal256Type>(a, divisor);
+                    product_scale -= diff;


I'm not sure this is correct, as it won't "recompute" the values that have already been evaluated?

Do you mean a? It is divided and rounded.

If say the 5th value of the array triggers a reduction in precision, we will apply the truncation to all values that follow, but the first four values will have been computed with the original product_scale?

Aah, I think you're correct. I forgot that in JVM that each value is a decimal object which keeps precision/scale by itself. So it allows each value has different precision/scale after the computation. I need to rethink this for a fix.

As the behavior we're going to match is under row-based and object-based, it is somehow different in many aspects regarding being an array kernel. Need to think more about it with a proper fix. 🤔

viirya · 2023-02-12T19:52:23Z

This might be a dumb question, but can we not do something with precision statically instead of per-value. i.e. if multiplying two decimal numbers with precision of 10 and scale 5, the output should have precision of 20 and scale 10. The plan can then intentionally truncate inputs using division if necessary?

Hmm, I think one reason is that we are not sure if truncation will actually happen before multiplying values. If not, early truncating inputs could loss precision which can be avoided?

This might be another dumb thought, but if you want truncating arithmetic, why not just cast to a floating point representation? It'll be faster and better supported? I guess I had thought the only reason to still bother with decimal precision these days was if you couldn't tolerate approximation?

A major reason for this work is to provide compatibility to existing queries. If our users happened to use decimal for whatever reasons and they just rely on such truncation behavior, they may not tolerate a different behavior like failing their queries. Besides, as truncating is not always happened but just a possibility, they might tolerate approximation for it for rare case (if happened) but for all values (this is what I thought, not voice from real users. so major reason is still compatibility).

liukun4515 · 2023-02-15T05:49:18Z

I want to review this PR in my time tomorrow. @viirya

alamb · 2023-03-02T15:22:29Z

FWIW @jackwener has identified that this PR will fix apache/datafusion#5396 downstream in DataFusion

tustvold · 2023-03-06T17:32:25Z

I have to confess to not really being sure how to move forward with this, as I have a number of concerns:

The performance will be catastrophic, it is formatting to strings, and performing multiple memory allocations per operation
The data type of the output depends on the input data, something that most query engines, including DataFusion, aren't setup to handle
The precision loss bleeds across rows, and therefore is deterministic on the batching, which typically isn't guaranteed in systems like DataFusion

My understanding is this is trying to emulate the behaviour of Spark decimals, which store precision and scale per value. However, I'm not sure this PR actually achieves this, in the event of overflow this PR will truncate the precision of all values in the array, potentially truncating values that wouldn't have been truncated by Spark. I'm therefore not really sure what this PR gains us, it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

alamb · 2023-03-06T18:28:48Z

it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

I think another concern is that this trade comes with a significant performance cost for the non overflow case

What if we changed the DataFusion type coercion code so that it was able to handle the large literal cases

Like special cased
https://github.com/apache/arrow-datafusion/blob/99ef989312292b04efdf06b178617e9008b46c84/datafusion/optimizer/src/type_coercion.rs#L589-L598

So that it could handle a literal that was too large for its declared type?

viirya · 2023-03-06T20:49:43Z

My understanding is this is trying to emulate the behaviour of Spark decimals, which store precision and scale per value. However, I'm not sure this PR actually achieves this, in the event of overflow this PR will truncate the precision of all values in the array, potentially truncating values that wouldn't have been truncated by Spark. I'm therefore not really sure what this PR gains us, it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

In Spark, decimal expression will be followed by an expression to round decimals. Spark analyzer does type coercion for decimal arithmetic operations and inserts such expression with required precision/scale. To emulate this too, we have a similar expression so that it produces decimals with same precision/scale as Spark.

In Spark, precision/scale is stored in decimal objects. Once truncating happens, it only truncates what needed to be truncated. In the follow up expression, it will be adjusted to required precision/scale if possible.

But as we just have i128 values, we cannot just truncate single value there as it results in inconsistent precision/scale. So it turns out to be current approach that truncates all values to same precision/scale at the end of this kernel. We have a similar expression as mentioned above that adjusts decimals to required precision/scale for the arithmetic operation.

Alternative thing to do for this, I think, is to produce an internal struct of i128, precision and scale during the operation. So we don't need to truncate other values. The kernel also takes required precision/scale as parameters and adjusts decimal values accordingly.

tustvold · 2023-03-06T21:02:11Z

So if I am following correctly, take the example of an array containing [1, 10, 100] multiplied by itself, with scale 2 and precision 3

In spark this will result in

1: {value: 1, scale: 2, precision: 3},
2: {value: 100, scale: 2, precision: 3},
3: {value: 100, scale: 0, precision: 3},

With this PR this will result in a new array with scale 4 and precision 3 containing

1: 0,
2: 1,
3: 100

I don't follow how this is comparable behaviour?

is to produce an internal struct of i128, precision and scale during the operation

We still ultimately need to convert to the arrow representation, which does not have per-value precision and scale. I don't see how you can avoid this, or to put it another way, I'm not sure compatibility with Spark's decimal representation is tractable...

viirya · 2023-03-06T21:35:55Z

So if I am following correctly, take the example of an array containing [1, 10, 100] multiplied by itself, with scale 2 and precision 3

In spark this will result in
1: {value: 1, scale: 2, precision: 3},
2: {value: 100, scale: 2, precision: 3},
3: {value: 100, scale: 0, precision: 3},

I guess the some values are incorrect. {value: 100, scale: 0, precision: 3} is 100.0, but I think the correct one should be 1.0? I guess you mean {value: 1, scale: 0, precision: 3}? I guess the result (if max precision is 3) is:

1: {value: 0, scale: 0, precision: 3},
2: {value: 1, scale: 2, precision: 3},
3: {value: 100, scale: 2, precision: 3},

Spark will insert a followup expression to make a final precision/scale for decimal. So the final result will be adjusted to same precision/scale again.

tustvold · 2023-03-06T21:41:40Z

So the final result will be adjusted to same precision/scale again

But I thought we just established that the values don't fit into the original precision / scale? I'm sorry but I'm really confused...

viirya · 2023-03-06T21:51:16Z

So the final result will be adjusted to same precision/scale again

But I thought we just established that the values don't fit into the original precision / scale? I'm sorry but I'm really confused...

Not sure I get it correctly. But the adjustment is to make sure all values have same precision/scale if possible.

The precision/scale can be inconsistent due to truncating per value. The followup expression will adjust them to same precision/scale (depending on the arithmetic op, decided by Spark type coercion rule). If value cannot be fit into it, it could be null or runtime exception, depending on config.

tustvold · 2023-03-06T21:57:51Z

The followup expression will adjust them to same precision/scale

Could you expand on this, how does it decide the output precision/scale to unify to? Is it the same as the input? Is it data dependent? Unless I am mistaken in arrow the precision and scale are an immutable part of the column's schema that must be established at plan time.

I wonder if the issue is that DataFusion is not inspecting the precision in order to work out if there is a potential for overflow, and scaling the input as appropriate based on this information? This is what I had understood the whole purpose of the precision argument as being for?

viirya · 2023-03-07T07:15:14Z

Could you expand on this, how does it decide the output precision/scale to unify to? Is it the same as the input? Is it data dependent? Unless I am mistaken in arrow the precision and scale are an immutable part of the column's schema that must be established at plan time.

Spark type coercion rule decides the output precision/scale: e.g., https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala#L126-L136. It is dependent on input precision/scale. Yea, it is established at plan (query analysis) time.

I wonder if the issue is that DataFusion is not inspecting the precision in order to work out if there is a potential for overflow, and scaling the input as appropriate based on this information? This is what I had understood the whole purpose of the precision argument as being for?

I think it is not possible to inspect potential overflow for input. Because overflowing or not isn't just dependent on precision/scale only but the actual input data.

tustvold · 2023-03-07T10:10:48Z

Ok, thank you for your patience with me. Would the following therefore be workable and consistent with Spark.

DataFusion is updated to follow Spark's type inference for arithmetic on decimals (if it doesn't already)
In particular, multiplication of decimals with precision and scale (p1, s1) and (p2, s2) yields (p1 + p2 + 1, s1 + s2)
By default, an error is returned if the output precision exceeds the maximum precision of the decimal type - as per Spark
Arithmetic can proceed as normal using the regular unchecked arithmetic kernels, as the precision ensures overflow cannot occur

We can then extend this so that if the output precision exceeds the maximum precision of the decimal type, and an allowDecimalPrecisionLoss config option is true, instead of returning a plan error DataFusion can reduce the precision of the result. Looking at the Spark logic for this, it only does this when it will leave at least 6 digits of fractional precision in the result.

To facilitate this we can add a mul_fixed_point and mul_fixed_point_checked to arrow-rs, along with potentially scalar variants, that in addition to their regular inputs, also take a scale. I suspect there is a more efficient way to do this, but at least initially this could just perform an unchecked/checked widening multiplication, divide the result, and then perform an unchecked/checked truncation. Initially we could just support i128, but I think supporting all integral types eventually would be cool.

I think this would address my concerns with this PR as it would be explicitly opt-in, deterministic, and have adequate performance. It would also have potential applications outside of just decimals. What do you think?

viirya · 2023-03-07T20:02:15Z

DataFusion is updated to follow Spark's type inference for arithmetic on decimals (if it doesn't already)

In particular, multiplication of decimals with precision and scale (p1, s1) and (p2, s2) yields (p1 + p2 + 1, s1 + s2)

I remember DataFusion already has type coercion rule for decimal type that copies the behavior from Spark's one (the posted code link above).

By default, an error is returned if the output precision exceeds the maximum precision of the decimal type - as per Spark

This is just the constraint on how a DecimalType is constructed. Obviously it isn't allowed to create a DecimalType over largest precision. It doesn't actually affect decimal arithmetic op.

Arithmetic can proceed as normal using the regular unchecked arithmetic kernels, as the precision ensures overflow cannot occur

So this is not correct actually. Spark has an expression to check overflow of decimal arithmetic op result. It checks if the value can fit into required precision/scale. If not, getting a null or exception, depending on config.

We can then extend this so that if the output precision exceeds the maximum precision of the decimal type, and an allowDecimalPrecisionLoss config option is true, instead of returning a plan error DataFusion can reduce the precision of the result. Looking at the Spark logic for this, it only does this when it will leave at least 6 digits of fractional precision in the result.

adjustPrecisionScale is just used by the type coercion rule to adjust precision/scale for decimal arithmetic op. I think DataFusion should already has it as it has copied Spark decimal type coercion rule. In other words, these items are basically analysis phase rule which gets the decimal type of arithmetic op result. But it doesn't guide how the op actually performs. After op, another expression will check overflow. Although Spark updates slightly this process in version 3.4, this process is still the same.

To facilitate this we can add a mul_fixed_point and mul_fixed_point_checked to arrow-rs, along with potentially scalar variants, that in addition to their regular inputs, also take a scale. I suspect there is a more efficient way to do this, but at least initially this could just perform an unchecked/checked widening multiplication, divide the result, and then perform an unchecked/checked truncation. Initially we could just support i128, but I think supporting all integral types eventually would be cool.

Let me figure it out. So this sounds similar to this PR but takes a scale parameter. After multiplication, truncating the result according to the required scale instead of minimum scale?

tustvold

Looking good, need to give this a review with fresh eyes, but left some initial comments

arrow-arith/src/arithmetic.rs

arrow-arith/src/arity.rs

arrow-arith/src/arithmetic.rs

tustvold · 2023-03-15T11:28:35Z

arrow-arith/src/arithmetic.rs

+
+        assert_eq!(&expected, &result);
+
+        // Rounded the value "1385459527.234567901207133052876543209876543210".


arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

viirya · 2023-03-16T07:38:00Z

Thanks for review. I will add mul_fixed_point and scalar variants in other PRs.

* Add multiply_decimal. * Fix scale for multiple value * Fix * Add doc * Update * Rename to mul_fixed_point_checked * For review * More * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-arith/src/arithmetic.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Fix --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Feb 10, 2023

viirya force-pushed the fix_decimal_arithmetic3 branch from 09acd95 to 7990356 Compare February 10, 2023 22:47

tustvold reviewed Feb 12, 2023

View reviewed changes

viirya marked this pull request as draft February 12, 2023 19:38

viirya marked this pull request as ready for review February 14, 2023 03:01

liukun4515 self-requested a review February 15, 2023 05:49

viirya force-pushed the fix_decimal_arithmetic3 branch from b2cb3fd to df9e047 Compare March 1, 2023 16:37

viirya mentioned this pull request Mar 1, 2023

Compute error: Overflow happened in tpcds_logical_q58 apache/datafusion#5396

Closed

jackwener mentioned this pull request Mar 2, 2023

add a unit test that cover cast bug. apache/datafusion#5443

Merged

tustvold marked this pull request as draft March 10, 2023 15:01

viirya added 5 commits March 13, 2023 23:13

Add multiply_decimal.

ee75384

Fix scale for multiple value

99fcfaa

Fix

bc4fd36

Add doc

d695b16

Update

c9d61ba

viirya force-pushed the fix_decimal_arithmetic3 branch from df9e047 to c9d61ba Compare March 14, 2023 06:41

Rename to mul_fixed_point_checked

ea847e8

viirya marked this pull request as ready for review March 14, 2023 19:03

tustvold reviewed Mar 14, 2023

View reviewed changes

arrow-arith/src/arithmetic.rs Show resolved Hide resolved

tustvold reviewed Mar 14, 2023

View reviewed changes

arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved

viirya added 2 commits March 14, 2023 17:29

For review

dd28aad

More

3948969

tustvold approved these changes Mar 15, 2023

View reviewed changes

viirya and others added 10 commits March 15, 2023 11:32

Update arrow-arith/src/arithmetic.rs

0287786

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

3561ded

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

d6ca919

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

c3e9be8

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

ee252e9

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

525fd01

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

6b70fea

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

8f16a75

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Update arrow-arith/src/arithmetic.rs

48ce4da

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Fix

6c1c3dd

viirya merged commit f4ac4e4 into apache:master Mar 16, 2023

tustvold mentioned this pull request Mar 24, 2023

Allow precision loss on multiplying decimal arrays #3689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow precision loss on multiplying decimal arrays #3690

Allow precision loss on multiplying decimal arrays #3690

viirya commented Feb 10, 2023

tustvold left a comment

tustvold Feb 12, 2023

viirya Feb 12, 2023

tustvold Feb 12, 2023

tustvold Feb 12, 2023

viirya Feb 13, 2023

tustvold Feb 12, 2023

viirya Feb 12, 2023

tustvold Feb 12, 2023

viirya Feb 12, 2023

viirya Feb 12, 2023

viirya commented Feb 12, 2023 •

edited

Loading

liukun4515 commented Feb 15, 2023

alamb commented Mar 2, 2023 •

edited

Loading

tustvold commented Mar 6, 2023 •

edited

Loading

alamb commented Mar 6, 2023

viirya commented Mar 6, 2023

tustvold commented Mar 6, 2023 •

edited

Loading

viirya commented Mar 6, 2023

tustvold commented Mar 6, 2023

viirya commented Mar 6, 2023 •

edited

Loading

tustvold commented Mar 6, 2023 •

edited

Loading

viirya commented Mar 7, 2023

tustvold commented Mar 7, 2023 •

edited

Loading

viirya commented Mar 7, 2023 •

edited

Loading

tustvold left a comment

tustvold Mar 15, 2023

viirya commented Mar 16, 2023


		assert_eq!(&expected, &result);

		// Rounded the value "1385459527.234567901207133052876543209876543210".

Allow precision loss on multiplying decimal arrays #3690

Allow precision loss on multiplying decimal arrays #3690

Conversation

viirya commented Feb 10, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Feb 12, 2023 • edited Loading

liukun4515 commented Feb 15, 2023

alamb commented Mar 2, 2023 • edited Loading

tustvold commented Mar 6, 2023 • edited Loading

alamb commented Mar 6, 2023

viirya commented Mar 6, 2023

tustvold commented Mar 6, 2023 • edited Loading

viirya commented Mar 6, 2023

tustvold commented Mar 6, 2023

viirya commented Mar 6, 2023 • edited Loading

tustvold commented Mar 6, 2023 • edited Loading

viirya commented Mar 7, 2023

tustvold commented Mar 7, 2023 • edited Loading

viirya commented Mar 7, 2023 • edited Loading

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 16, 2023

viirya commented Feb 12, 2023 •

edited

Loading

alamb commented Mar 2, 2023 •

edited

Loading

tustvold commented Mar 6, 2023 •

edited

Loading

tustvold commented Mar 6, 2023 •

edited

Loading

viirya commented Mar 6, 2023 •

edited

Loading

tustvold commented Mar 6, 2023 •

edited

Loading

tustvold commented Mar 7, 2023 •

edited

Loading

viirya commented Mar 7, 2023 •

edited

Loading