Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow precision loss on multiplying decimal arrays #3690

Merged
merged 18 commits into from
Mar 16, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 10, 2023

Which issue does this PR close?

Closes #3689.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 10, 2023
@viirya viirya force-pushed the fix_decimal_arithmetic3 branch from 09acd95 to 7990356 Compare February 10, 2023 22:47
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a dumb question, but can we not do something with precision statically instead of per-value. i.e. if multiplying two decimal numbers with precision of 10 and scale 5, the output should have precision of 20 and scale 10. The plan can then intentionally truncate inputs using division if necessary?

This might be another dumb thought, but if you want truncating arithmetic, why not just cast to a floating point representation? It'll be faster and better supported? I guess I had thought the only reason to still bother with decimal precision these days was if you couldn't tolerate approximation?

let precision = left.precision();
let mut product_scale = left.scale() + right.scale();

math_checked_op(left, right, |a, b| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a stupid question, I know very little about decimal arithmetic, but if you are just going to truncate the result why do you need to perform multiplication as the larger type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of multiplication between smaller type (i.e. i128) could be overflowed but it could still be represented as i128 after we truncate it below. If we don't perform multiplication as the larger type, I'm not sure if we can get the (overflowed) result and truncate on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I hadn't got to the truncation logic when I wrote this 😄

arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
.map(|mut a| {
// Round the value if an exact representation is not possible.
// ref: java.math.BigDecimal#doRound
let mut digits = a.to_string().len() as i8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log10 perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log10 on i256?

while diff > 0 {
let divisor = i256::from_i128(10).pow_wrapping(diff as u32);
a = divide_and_round::<Decimal256Type>(a, divisor);
product_scale -= diff;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct, as it won't "recompute" the values that have already been evaluated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a? It is divided and rounded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If say the 5th value of the array triggers a reduction in precision, we will apply the truncation to all values that follow, but the first four values will have been computed with the original product_scale?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I think you're correct. I forgot that in JVM that each value is a decimal object which keeps precision/scale by itself. So it allows each value has different precision/scale after the computation. I need to rethink this for a fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the behavior we're going to match is under row-based and object-based, it is somehow different in many aspects regarding being an array kernel. Need to think more about it with a proper fix. 🤔

@viirya viirya marked this pull request as draft February 12, 2023 19:38
@viirya
Copy link
Member Author

viirya commented Feb 12, 2023

This might be a dumb question, but can we not do something with precision statically instead of per-value. i.e. if multiplying two decimal numbers with precision of 10 and scale 5, the output should have precision of 20 and scale 10. The plan can then intentionally truncate inputs using division if necessary?

Hmm, I think one reason is that we are not sure if truncation will actually happen before multiplying values. If not, early truncating inputs could loss precision which can be avoided?

This might be another dumb thought, but if you want truncating arithmetic, why not just cast to a floating point representation? It'll be faster and better supported? I guess I had thought the only reason to still bother with decimal precision these days was if you couldn't tolerate approximation?

A major reason for this work is to provide compatibility to existing queries. If our users happened to use decimal for whatever reasons and they just rely on such truncation behavior, they may not tolerate a different behavior like failing their queries. Besides, as truncating is not always happened but just a possibility, they might tolerate approximation for it for rare case (if happened) but for all values (this is what I thought, not voice from real users. so major reason is still compatibility).

@viirya viirya marked this pull request as ready for review February 14, 2023 03:01
@liukun4515
Copy link
Contributor

I want to review this PR in my time tomorrow. @viirya

@alamb
Copy link
Contributor

alamb commented Mar 2, 2023

FWIW @jackwener has identified that this PR will fix apache/datafusion#5396 downstream in DataFusion

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2023

I have to confess to not really being sure how to move forward with this, as I have a number of concerns:

  • The performance will be catastrophic, it is formatting to strings, and performing multiple memory allocations per operation
  • The data type of the output depends on the input data, something that most query engines, including DataFusion, aren't setup to handle
  • The precision loss bleeds across rows, and therefore is deterministic on the batching, which typically isn't guaranteed in systems like DataFusion

My understanding is this is trying to emulate the behaviour of Spark decimals, which store precision and scale per value. However, I'm not sure this PR actually achieves this, in the event of overflow this PR will truncate the precision of all values in the array, potentially truncating values that wouldn't have been truncated by Spark. I'm therefore not really sure what this PR gains us, it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

@alamb
Copy link
Contributor

alamb commented Mar 6, 2023

it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

I think another concern is that this trade comes with a significant performance cost for the non overflow case

What if we changed the DataFusion type coercion code so that it was able to handle the large literal cases

Like special cased
https://github.com/apache/arrow-datafusion/blob/99ef989312292b04efdf06b178617e9008b46c84/datafusion/optimizer/src/type_coercion.rs#L589-L598

So that it could handle a literal that was too large for its declared type?

@viirya
Copy link
Member Author

viirya commented Mar 6, 2023

My understanding is this is trying to emulate the behaviour of Spark decimals, which store precision and scale per value. However, I'm not sure this PR actually achieves this, in the event of overflow this PR will truncate the precision of all values in the array, potentially truncating values that wouldn't have been truncated by Spark. I'm therefore not really sure what this PR gains us, it trades a behaviour change that results in an error, to a behaviour change that results in silent truncation? Am I missing something here?

In Spark, decimal expression will be followed by an expression to round decimals. Spark analyzer does type coercion for decimal arithmetic operations and inserts such expression with required precision/scale. To emulate this too, we have a similar expression so that it produces decimals with same precision/scale as Spark.

In Spark, precision/scale is stored in decimal objects. Once truncating happens, it only truncates what needed to be truncated. In the follow up expression, it will be adjusted to required precision/scale if possible.

But as we just have i128 values, we cannot just truncate single value there as it results in inconsistent precision/scale. So it turns out to be current approach that truncates all values to same precision/scale at the end of this kernel. We have a similar expression as mentioned above that adjusts decimals to required precision/scale for the arithmetic operation.

Alternative thing to do for this, I think, is to produce an internal struct of i128, precision and scale during the operation. So we don't need to truncate other values. The kernel also takes required precision/scale as parameters and adjusts decimal values accordingly.

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2023

So if I am following correctly, take the example of an array containing [1, 10, 100] multiplied by itself, with scale 2 and precision 3

In spark this will result in

1: {value: 1, scale: 2, precision: 3},
2: {value: 100, scale: 2, precision: 3},
3: {value: 100, scale: 0, precision: 3},

With this PR this will result in a new array with scale 4 and precision 3 containing

1: 0,
2: 1,
3: 100

I don't follow how this is comparable behaviour?

is to produce an internal struct of i128, precision and scale during the operation

We still ultimately need to convert to the arrow representation, which does not have per-value precision and scale. I don't see how you can avoid this, or to put it another way, I'm not sure compatibility with Spark's decimal representation is tractable...

@viirya
Copy link
Member Author

viirya commented Mar 6, 2023

So if I am following correctly, take the example of an array containing [1, 10, 100] multiplied by itself, with scale 2 and precision 3

In spark this will result in

1: {value: 1, scale: 2, precision: 3},
2: {value: 100, scale: 2, precision: 3},
3: {value: 100, scale: 0, precision: 3},

I guess the some values are incorrect. {value: 100, scale: 0, precision: 3} is 100.0, but I think the correct one should be 1.0? I guess you mean {value: 1, scale: 0, precision: 3}? I guess the result (if max precision is 3) is:

1: {value: 0, scale: 0, precision: 3},
2: {value: 1, scale: 2, precision: 3},
3: {value: 100, scale: 2, precision: 3},

Spark will insert a followup expression to make a final precision/scale for decimal. So the final result will be adjusted to same precision/scale again.

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2023

So the final result will be adjusted to same precision/scale again

But I thought we just established that the values don't fit into the original precision / scale? I'm sorry but I'm really confused...

@viirya
Copy link
Member Author

viirya commented Mar 6, 2023

So the final result will be adjusted to same precision/scale again

But I thought we just established that the values don't fit into the original precision / scale? I'm sorry but I'm really confused...

Not sure I get it correctly. But the adjustment is to make sure all values have same precision/scale if possible.

The precision/scale can be inconsistent due to truncating per value. The followup expression will adjust them to same precision/scale (depending on the arithmetic op, decided by Spark type coercion rule). If value cannot be fit into it, it could be null or runtime exception, depending on config.

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2023

The followup expression will adjust them to same precision/scale

Could you expand on this, how does it decide the output precision/scale to unify to? Is it the same as the input? Is it data dependent? Unless I am mistaken in arrow the precision and scale are an immutable part of the column's schema that must be established at plan time.

I wonder if the issue is that DataFusion is not inspecting the precision in order to work out if there is a potential for overflow, and scaling the input as appropriate based on this information? This is what I had understood the whole purpose of the precision argument as being for?

@viirya
Copy link
Member Author

viirya commented Mar 7, 2023

Could you expand on this, how does it decide the output precision/scale to unify to? Is it the same as the input? Is it data dependent? Unless I am mistaken in arrow the precision and scale are an immutable part of the column's schema that must be established at plan time.

Spark type coercion rule decides the output precision/scale: e.g., https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala#L126-L136. It is dependent on input precision/scale. Yea, it is established at plan (query analysis) time.

I wonder if the issue is that DataFusion is not inspecting the precision in order to work out if there is a potential for overflow, and scaling the input as appropriate based on this information? This is what I had understood the whole purpose of the precision argument as being for?

I think it is not possible to inspect potential overflow for input. Because overflowing or not isn't just dependent on precision/scale only but the actual input data.

@tustvold
Copy link
Contributor

tustvold commented Mar 7, 2023

Ok, thank you for your patience with me. Would the following therefore be workable and consistent with Spark.

  • DataFusion is updated to follow Spark's type inference for arithmetic on decimals (if it doesn't already)
  • In particular, multiplication of decimals with precision and scale (p1, s1) and (p2, s2) yields (p1 + p2 + 1, s1 + s2)
  • By default, an error is returned if the output precision exceeds the maximum precision of the decimal type - as per Spark
  • Arithmetic can proceed as normal using the regular unchecked arithmetic kernels, as the precision ensures overflow cannot occur

We can then extend this so that if the output precision exceeds the maximum precision of the decimal type, and an allowDecimalPrecisionLoss config option is true, instead of returning a plan error DataFusion can reduce the precision of the result. Looking at the Spark logic for this, it only does this when it will leave at least 6 digits of fractional precision in the result.

To facilitate this we can add a mul_fixed_point and mul_fixed_point_checked to arrow-rs, along with potentially scalar variants, that in addition to their regular inputs, also take a scale. I suspect there is a more efficient way to do this, but at least initially this could just perform an unchecked/checked widening multiplication, divide the result, and then perform an unchecked/checked truncation. Initially we could just support i128, but I think supporting all integral types eventually would be cool.

I think this would address my concerns with this PR as it would be explicitly opt-in, deterministic, and have adequate performance. It would also have potential applications outside of just decimals. What do you think?

@viirya
Copy link
Member Author

viirya commented Mar 7, 2023

  • DataFusion is updated to follow Spark's type inference for arithmetic on decimals (if it doesn't already)
  • In particular, multiplication of decimals with precision and scale (p1, s1) and (p2, s2) yields (p1 + p2 + 1, s1 + s2)

I remember DataFusion already has type coercion rule for decimal type that copies the behavior from Spark's one (the posted code link above).

  • By default, an error is returned if the output precision exceeds the maximum precision of the decimal type - as per Spark

This is just the constraint on how a DecimalType is constructed. Obviously it isn't allowed to create a DecimalType over largest precision. It doesn't actually affect decimal arithmetic op.

  • Arithmetic can proceed as normal using the regular unchecked arithmetic kernels, as the precision ensures overflow cannot occur

So this is not correct actually. Spark has an expression to check overflow of decimal arithmetic op result. It checks if the value can fit into required precision/scale. If not, getting a null or exception, depending on config.

We can then extend this so that if the output precision exceeds the maximum precision of the decimal type, and an allowDecimalPrecisionLoss config option is true, instead of returning a plan error DataFusion can reduce the precision of the result. Looking at the Spark logic for this, it only does this when it will leave at least 6 digits of fractional precision in the result.

adjustPrecisionScale is just used by the type coercion rule to adjust precision/scale for decimal arithmetic op. I think DataFusion should already has it as it has copied Spark decimal type coercion rule. In other words, these items are basically analysis phase rule which gets the decimal type of arithmetic op result. But it doesn't guide how the op actually performs. After op, another expression will check overflow. Although Spark updates slightly this process in version 3.4, this process is still the same.

To facilitate this we can add a mul_fixed_point and mul_fixed_point_checked to arrow-rs, along with potentially scalar variants, that in addition to their regular inputs, also take a scale. I suspect there is a more efficient way to do this, but at least initially this could just perform an unchecked/checked widening multiplication, divide the result, and then perform an unchecked/checked truncation. Initially we could just support i128, but I think supporting all integral types eventually would be cool.

Let me figure it out. So this sounds similar to this PR but takes a scale parameter. After multiplication, truncating the result according to the required scale instead of minimum scale?

@tustvold tustvold marked this pull request as draft March 10, 2023 15:01
@viirya viirya force-pushed the fix_decimal_arithmetic3 branch from df9e047 to c9d61ba Compare March 14, 2023 06:41
@viirya viirya marked this pull request as ready for review March 14, 2023 19:03
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, need to give this a review with fresh eyes, but left some initial comments

arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arity.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Show resolved Hide resolved
arrow-arith/src/arithmetic.rs Show resolved Hide resolved

assert_eq!(&expected, &result);

// Rounded the value "1385459527.234567901207133052876543209876543210".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

arrow-arith/src/arithmetic.rs Show resolved Hide resolved
viirya and others added 10 commits March 15, 2023 11:32
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
@viirya
Copy link
Member Author

viirya commented Mar 16, 2023

Thanks for review. I will add mul_fixed_point and scalar variants in other PRs.

@viirya viirya merged commit f4ac4e4 into apache:master Mar 16, 2023
metesynnada pushed a commit to synnada-ai/arrow-rs that referenced this pull request Mar 16, 2023
* Add multiply_decimal.

* Fix scale for multiple value

* Fix

* Add doc

* Update

* Rename to mul_fixed_point_checked

* For review

* More

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Update arrow-arith/src/arithmetic.rs

Co-authored-by: Raphael Taylor-Davies <[email protected]>

* Fix

---------

Co-authored-by: Raphael Taylor-Davies <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow precision loss on multiplying decimal arrays
4 participants