Optimize the validation of `Decimal256` #2320

liukun4515 · 2022-08-04T11:53:52Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When arrow construct a decimal array, we need to verify the data of decimal with the precision.

From current implemention

arrow-rs/arrow/src/datatypes/datatype.rs

Line 456 in d87f6a4

    
           pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Result<i128> {

and

arrow-rs/arrow/src/datatypes/datatype.rs

Line 485 in d87f6a4

pub(crate) fn validate_decimal256_precision(

We compare the decimal using i128 or string, it may be inefficient.

optimize the validation using the byte array instead of i128 or string.

Describe the solution you'd like

For exmaple, precision 3, The max value is i128.to_le_byte(999), and the min value if i128.to_le_byte(-999).
Comparing the decimal128(value,precision,scale) just need to compare the value ([u8;16]) with the min and max.

Describe alternatives you've considered

Additional context

liukun4515 · 2022-08-04T11:54:47Z

This also can speed up the construction of decimal arrays like #2313

liukun4515 · 2022-08-04T11:55:27Z

FYI @viirya @alamb

I will file a pr tomorrow to do this.

liukun4515 · 2022-08-04T12:03:26Z

Maybe we should add iter<Option<[u8]>> for decimal array and reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

liukun4515 · 2022-08-04T12:27:47Z

Maybe we should add iter<Option<[u8]>> for decimal array and reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

Also just iterate the array like value_unchecked to get the raw_data

alamb · 2022-08-04T20:01:05Z

I think it is important to:

ensure that some sort of validate is called for every insertion of a decimal value into a Decimal128Array or Decimal256Array
ensure that the validation is only done once per value
Allow some way to skip this validation if the user wishes (e.g. ArrayData::new_unchecked https://docs.rs/arrow/19.0.0/arrow/array/struct.ArrayData.html#method.new_unchecked)

Performance wise I haven't checked if we are validating things twice

liukun4515 · 2022-08-05T02:10:12Z

also add the benchmark result of this function.

tustvold · 2022-08-06T09:58:18Z

I find it unexpected that a variable length byte array outperforms a fixed-length i128?

reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

I think it would help to take a step back here, as I at least am very confused as to what is being validated when. For most array types we have an invariant that ArrayData only contains valid data, this means we can elide validation on access. This does not appear to be the case for DecimalArray, which I personally think is a bug.

For example, BasicDecimal::new is not unsafe, BasicDecimalArray::from_fixed_size_binary_array doesn't perform validation, etc... If we can make there an invariant of valid data, and consistently enforce this, we can then look to elide validation checks on access

liukun4515 · 2022-08-06T11:38:41Z

I find it unexpected that a variable length byte array outperforms a fixed-length i128?

reduce construct the decimal128 and decimal258 which contain the useless precision/scale for validation.

@tustvold
It's not variable length byte array.
I think should be a fixed length byte array, for example decimal128 is [u8;16], decimal256 is [u8;32].

Have you benched these two implementation?

In current logic, we should get the byte array and convert the byte array to i128 or bigint/string first, and then compare with the bound min/max value.

I guess we can just compare the fixed length byte array with the bound min/max with the format of fixed length byte array.

liukun4515 · 2022-08-06T12:24:43Z

I think it would help to take a step back here, as I at least am very confused as to what is being validated when. For most array types we have an invariant that ArrayData only contains valid data, this means we can elide validation on access. This does not appear to be the case for DecimalArray, which I personally think is a bug.

We can get the decimal array from decimal array builder.
When appending a decimal value to the builder, we need to verify it.

But in the arrow-rs, I find some case in which we can skip the validation. For example, reading decimal value from parquet file(INT32,INT64, OR byte array), taking data from decimal array. We can refile these cases.

viirya · 2022-08-06T17:37:48Z

Currently we only do decimal validation where:

Changing precision/scale by with_precision_scale. It is also selective since it does validation only new precision is smaller.
Appending value to decimal builders
Performing full validation on ArrayData
Reading CSV file

I think it is somehow selective right now. Actually we do decimal validation less than necessary, as mentioned above from_fixed_size_binary_array doesn't validate input values. We just assume that the inputs are all valid if users use such API do build decimal array.

liukun4515 · 2022-08-08T03:11:49Z

Currently we only do decimal validation where:

Changing precision/scale by with_precision_scale. It is also selective since it does validation only new precision is smaller.

When we create the decimal array from iter<Option>, and need to do validation. Because the default precision is the max precision(38).

Appending value to decimal builders

Performing full validation on ArrayData

Reading CSV file

I think it is somehow selective right now. Actually we do decimal validation less than necessary, as mentioned above from_fixed_size_binary_array doesn't validate input values. We just assume that the inputs are all valid if users use such API do build decimal array.

HaoYang670 · 2022-08-12T03:52:52Z

Maybe we'd better update the title if we don't update the decimal128 validation.

alamb · 2022-08-12T17:04:16Z

Title is updated

liukun4515 added the enhancement Any new improvement worthy of a entry in the changelog label Aug 4, 2022

liukun4515 self-assigned this Aug 4, 2022

liukun4515 mentioned this issue Aug 6, 2022

Do we need to public the MAX_DECIMAL_FOR_LARGER_PRECISION and MIN_DECIMAL_FOR_EACH_PRECISION #2343

Closed

liukun4515 mentioned this issue Aug 8, 2022

Speed up Decimal256 validation based on bytes comparison and add benchmark test #2360

Merged

liukun4515 closed this as completed in #2360 Aug 12, 2022

alamb changed the title ~~optimize the validation of decimal: compare data using the byte array instead of i128 or string~~ optimize the validation of Decimal256: compare data using the byte array instead of i128 or string Aug 12, 2022

alamb changed the title ~~optimize the validation of Decimal256: compare data using the byte array instead of i128 or string~~ optimize the validation of Decimal256: compare data using the byte array instead of string Aug 12, 2022

alamb changed the title ~~optimize the validation of Decimal256: compare data using the byte array instead of string~~ Optimize the validation of Decimal256: compare data using the byte array instead of string Aug 17, 2022

alamb added the arrow Changes to the arrow crate label Aug 17, 2022

alamb changed the title ~~Optimize the validation of Decimal256: compare data using the byte array instead of string~~ Optimize the validation of Decimal256 Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the validation of `Decimal256` #2320

Optimize the validation of `Decimal256` #2320

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

alamb commented Aug 4, 2022

liukun4515 commented Aug 5, 2022

tustvold commented Aug 6, 2022 •

edited

Loading

liukun4515 commented Aug 6, 2022 •

edited

Loading

liukun4515 commented Aug 6, 2022 •

edited

Loading

viirya commented Aug 6, 2022

liukun4515 commented Aug 8, 2022

HaoYang670 commented Aug 12, 2022

alamb commented Aug 12, 2022

Optimize the validation of Decimal256 #2320

Optimize the validation of Decimal256 #2320

Comments

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

liukun4515 commented Aug 4, 2022

alamb commented Aug 4, 2022

liukun4515 commented Aug 5, 2022

tustvold commented Aug 6, 2022 • edited Loading

liukun4515 commented Aug 6, 2022 • edited Loading

liukun4515 commented Aug 6, 2022 • edited Loading

viirya commented Aug 6, 2022

liukun4515 commented Aug 8, 2022

HaoYang670 commented Aug 12, 2022

alamb commented Aug 12, 2022

Optimize the validation of `Decimal256` #2320

Optimize the validation of `Decimal256` #2320

tustvold commented Aug 6, 2022 •

edited

Loading

liukun4515 commented Aug 6, 2022 •

edited

Loading

liukun4515 commented Aug 6, 2022 •

edited

Loading