Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DecimalArray API ergonomics: add iter(), FromIterator, with_precision_and_scale #1223

Merged
merged 11 commits into from
Feb 8, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 23, 2022

Which issue does this PR close?

Closes #1009

Closes #1083

Rationale for this change

It is somewhat awkward to create DecimalArray as well as iterate through their values.

This leads to (repeated) code like create_decimal_array https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+create_decimal_array&patternType=literal

    fn create_decimal_array(
        array: &[Option<i128>],
        precision: usize,
        scale: usize,
    ) -> DecimalArray {
        let mut decimal_builder = DecimalBuilder::new(array.len(), precision, scale);
        for value in array {
            match value {
                None => {
                    decimal_builder.append_null().unwrap();
                }
                Some(v) => {
                    decimal_builder.append_value(*v).unwrap();
                }
            }
        }
        decimal_builder.finish()
    }

(there is similar repetition in datafusion)

There are also more bounds checks than necessary (given DecimalArray::value() checks on each access

What changes are included in this PR?

  1. Add DecimalArray::from and DecimalArray::from_iter_values (mirroring PrimitiveArray)
  2. Add DecimalArray::into_iter() and DecimalArray::iter() for iterating through values
  3. Add DecimalArray::with_precision_and_scale() for changing the relevant precision and scale, and validate precision
  4. Add documentation
  5. Refactor some existing code to show how the new APIs can be used

Are there any user-facing changes?

  1. Nicer APIs for creating and working with DecimalArrays
  2. Constants for working with Decimal values are moved to datatype

Follow on PRs:

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 23, 2022
Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @liukun4515 I coded this one up to try and improve the code for working with DecimalArrays (and I wanted to write some code this weekend rather than just reviewing code!)

arrow/src/array/array_binary.rs Outdated Show resolved Hide resolved
/// builder.append_null().unwrap();
/// builder.append_value(-8_887_000_000).unwrap();
/// let decimal_array: DecimalArray = builder.finish();
/// // Create a DecimalArray with the default precision and scale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an example of how the new API workrs -- I think it is easier to understand and use

}

#[test]
fn test_decimal_iter() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also included a basic iter() and into_iter() functions -- while there is room for performance improvement I think the API is solid

@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2022

Codecov Report

Merging #1223 (cd83d94) into master (c7e36ea) will increase coverage by 0.01%.
The diff coverage is 91.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1223      +/-   ##
==========================================
+ Coverage   83.02%   83.04%   +0.01%     
==========================================
  Files         180      180              
  Lines       52289    52418     +129     
==========================================
+ Hits        43413    43529     +116     
- Misses       8876     8889      +13     
Impacted Files Coverage Δ
arrow/src/array/mod.rs 100.00% <ø> (ø)
arrow/src/datatypes/datatype.rs 66.40% <66.66%> (+0.01%) ⬆️
arrow/src/array/array_binary.rs 93.53% <93.16%> (-0.12%) ⬇️
arrow/src/array/builder.rs 86.73% <100.00%> (-0.04%) ⬇️
arrow/src/array/data.rs 83.13% <100.00%> (+0.10%) ⬆️
arrow/src/compute/kernels/cast.rs 95.21% <100.00%> (+<0.01%) ⬆️
arrow/src/csv/reader.rs 88.12% <100.00%> (ø)
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7e36ea...cd83d94. Read the comment docs.

/// The default precision and scale used when not specified
pub fn default_type() -> DataType {
// Keep maximum precision
DataType::Decimal(38, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can replace 38 to DECIMAL_MAX_PRECISION, and 10 to OTHER_DEFAULT_VALUE.

@@ -189,6 +194,12 @@ impl fmt::Display for DataType {
}
}

/// The maximum precision for [DataType::Decimal] values
pub const DECIMAL_MAX_PRECISION: usize = 38;
Copy link
Contributor

@liukun4515 liukun4515 Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the datafusion, we use
https://github.com/apache/arrow-datafusion/blob/e92225d6f4660363eec3b1e767a188fccebb7ed9/datafusion/src/scalar.rs#L39 .
Maybe we can replace the const in the datafusion.
If we have the plan to support the decimal256, we may need to use some labels to identify diff decimal type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should remove the duplication in DataFusion -- when this PR is merged, I can file a ticket in DataFusion to upgrade to the new DecimalAPI and remove the duplication

@liukun4515
Copy link
Contributor

liukun4515 commented Jan 24, 2022

Do you want to refactor some codes like

fn create_decimal_array(
in this pull request?

If you have the plan to refactor this code, I will review this later.

LGTM, expect the #1223 (comment)

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2022

Do you want to refactor some codes like ... in this pull request?

I was planning to do that in a follow on PR to keep this PR small and focused on the API changes.

So I will plan to make the changes you suggest in this PR, and then prepare a new one that refactors the rest of the code.

Thank you for the review.

@alamb
Copy link
Contributor Author

alamb commented Jan 26, 2022

Update: I have not forgotten about this PR but I have been busy with other tasks this week. If I don't get to it later this week I'll work on it over the weekend

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2022

🤔 while working to port some of the code over, it turns out that DecimalBuilder also does value validation on each value (so it is certain that the decimal values are within correct range. I need to think about this 🤔

@alamb alamb marked this pull request as draft January 29, 2022 20:10
@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2022

back to draft while I think about this a bit

@alamb alamb marked this pull request as ready for review January 30, 2022 12:41
@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2022

This PR is now ready for (re) review @liukun4515

cc @sweb as you contributed the initial DecimalArray implementation (thanks again!)

You can see examples of how the API is used in #1247 and #1249

@@ -1153,87 +1153,6 @@ pub struct FixedSizeBinaryBuilder {
builder: FixedSizeListBuilder<UInt8Builder>,
}

pub const MAX_DECIMAL_FOR_EACH_PRECISION: [i128; 38] = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to datatype

)));
}
Ok(result)
validate_decimal_precision(result, precision)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consolidated the bounds checking (so that I could also use it in with_precision_and_scale)

@alamb
Copy link
Contributor Author

alamb commented Feb 3, 2022

FYI @liukun4515 this is ready for review. I know you away for a few days so no worries.

I plan to wait for a review prior to merging this PR; If it gets reviewed by tomorrow's cutoff for arrow 9.0.0 I'll include it there, otherwise this can wait for 2 weeks and perhaps be included in arrow 10.0.0

@liukun4515
Copy link
Contributor

FYI @liukun4515 this is ready for review. I know you away for a few days so no worries.

I plan to wait for a review prior to merging this PR; If it gets reviewed by tomorrow's cutoff for arrow 9.0.0 I'll include it there, otherwise this can wait for 2 weeks and perhaps be included in arrow 10.0.0

I will review it later today.

@@ -1153,87 +1153,6 @@ pub struct FixedSizeBinaryBuilder {
builder: FixedSizeListBuilder<UInt8Builder>,
}

pub const MAX_DECIMAL_FOR_EACH_PRECISION: [i128; 38] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is an API breaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 'api-change' label

@@ -2297,7 +2298,7 @@ mod tests {
let array = Arc::new(array) as ArrayRef;
let casted_array = cast(&array, &DataType::Decimal(3, 1));
assert!(casted_array.is_err());
assert_eq!("Invalid argument error: The value of 1000 i128 is not compatible with Decimal(3,1)", casted_array.unwrap_err().to_string());
assert_eq!("Invalid argument error: 1000 is too large to store in a Decimal of precision 3. Max is 999", casted_array.unwrap_err().to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice error message.

assert_eq!(actual, expected);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tests.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Sorry for the delayed review.

@alamb
Copy link
Contributor Author

alamb commented Feb 8, 2022

No problem -- thanks for the review @liukun4515

@alamb alamb merged commit 35e16be into apache:master Feb 8, 2022
@alamb alamb added the api-change Changes to the arrow API label Feb 8, 2022
@alamb alamb deleted the alamb/decimal_creation branch February 8, 2022 20:10
@alamb
Copy link
Contributor Author

alamb commented Feb 8, 2022

Thanks again for the help @liukun4515

@alamb alamb removed the api-change Changes to the arrow API label Feb 16, 2022
@alamb alamb changed the title DecimalArray API ergonomics: add iter(), create from iter(), change precision / scale Improve DecimalArray API ergonomics: add iter(), FromIterator, change precision / scale Feb 16, 2022
@alamb alamb changed the title Improve DecimalArray API ergonomics: add iter(), FromIterator, change precision / scale Improve DecimalArray API ergonomics: add iter(), FromIterator, with_precision_and_scale Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add into iter for decimal array Add creator from Iterator of i128 to get the decimalarray
3 participants