Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Decimal32/Decimal64 support #683

Merged
merged 22 commits into from
Nov 20, 2024
Merged

feat: add Decimal32/Decimal64 support #683

merged 22 commits into from
Nov 20, 2024

Conversation

zeroshade
Copy link
Member

Initial implementation of Decimal32/Decimal64 support in nanoarrow.

Tests will be added in a bit, but I figured I'd get a draft up so people can take a look in the meantime.

@zeroshade zeroshade marked this pull request as ready for review November 19, 2024 17:03
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this on! Just one preliminary note about avoiding n_words == 0 and a suggestions about the tests 🙂

(I think you've seen it, but a heads up that a PR just merged making Arrow C++ optional!)

@@ -916,13 +922,14 @@ static inline void ArrowDecimalInit(struct ArrowDecimal* decimal, int32_t bitwid
memset(decimal->words, 0, sizeof(decimal->words));
decimal->precision = precision;
decimal->scale = scale;
// n_words will be 0 for bitwidth == 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to avoid the n_words == 0 situation? (Or could it be explained in a comment somewhere how it works?).

Alternatively, could we special-case the bit in ArrowArrayViewGetDecimalUnsafe() to populate the ArrowDecimal as if it were a 64-bit decimal by copying the appropriate bytes? Or update the words to be smaller (seems hard)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely add an explanation in a comment for it, but I don't know how to avoid the n_words == 0 situation without updating words to be smaller, personally. I can do that if you want, but I agree it would be hard given how often it is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow C++ still operates against 64 bit words with decimals right? Guessing the 32 bit implementation upstream still needs to be worked out?

https://github.com/apache/arrow/blob/cd3321b28b0c9703e5d7105d6146c1270bbadd7f/cpp/src/arrow/util/decimal.cc#L527

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow C++ utilizes templates and other techniques to represent Decimal32 as a 32-bit integer while using 64-bit words for everything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

Can you add a note to the documentation for n_words explaining that a value of 0 is special-cased for 32-bit decimals?

/// \brief An array of 64-bit integers of n_words length defined in native-endian order
uint64_t words[4];

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note for the documentation of both n_words and words to indicate the behavior for 32-bit decimal values.

@@ -3821,6 +3901,80 @@ TEST(ArrayViewTest, ArrayViewTestGetIntervalMonthDayNano) {
ArrowArrayRelease(&array);
}

TEST(ArrayViewTest, ArrayViewTestGetDecimal32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can avoid some of the copy/paste here by something like:

void TestArrayViewGetDecimal(ArrowType type, int precision, int scale) { ... }
void TestArrayViewDecimalArrowRoundtrip(ArrowType type, int precision, int scale, BuilderT builder) { ... }

TEST(ArrayViewTest, ArrayViewTestGetDecimal32) {
  TestArrayViewGetDecimal(...);
#if defined(BUILD_TESTS_WITH_ARROW)
  TestArrayViewDecimalArrowRoundtrip();
#endif
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll do this in a follow-up (not your fault I didn't know how to use GTest properly when I wrote the first two!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I can try to figure out something now if you prefer. But i would definitely prefer it as a follow-up lol

@zeroshade
Copy link
Member Author

@paleolimbot Any idea why the R build is failing?

@paleolimbot
Copy link
Member

My guess is that you inserted the type ids in the middle of the list, which broke:

arrow-nanoarrow/r/R/type.R

Lines 433 to 479 in 2930787

# These values aren't guaranteed to stay stable between nanoarrow versions,
# so we keep them internal but use them in these functions to simplify the
# number of C functions we need to build all the types.
NANOARROW_TYPE <- list(
UNINITIALIZED = 0,
"NA" = 1L,
BOOL = 2L,
UINT8 = 3L,
INT8 = 4L,
UINT16 = 5L,
INT16 = 6L,
UINT32 = 7L,
INT32 = 8L,
UINT64 = 9L,
INT64 = 10L,
HALF_FLOAT = 11L,
FLOAT = 12L,
DOUBLE = 13L,
STRING = 14L,
BINARY = 15L,
FIXED_SIZE_BINARY = 16L,
DATE32 = 17L,
DATE64 = 18L,
TIMESTAMP = 19L,
TIME32 = 20L,
TIME64 = 21L,
INTERVAL_MONTHS = 22L,
INTERVAL_DAY_TIME = 23L,
DECIMAL128 = 24L,
DECIMAL256 = 25L,
LIST = 26L,
STRUCT = 27L,
SPARSE_UNION = 28L,
DENSE_UNION = 29L,
DICTIONARY = 30L,
MAP = 31L,
EXTENSION = 32L,
FIXED_SIZE_LIST = 33L,
DURATION = 34L,
LARGE_STRING = 35L,
LARGE_BINARY = 36L,
LARGE_LIST = 37L,
INTERVAL_MONTH_DAY_NANO = 38L,
RUN_END_ENCODED = 39L,
BINARY_VIEW = 40L,
STRING_VIEW = 41L
)

=While we don't guarantee the stability of those values (i.e., we could just update the values in the .R file), it's probably safer to add new types to the end of the enum.

@paleolimbot
Copy link
Member

Also a note that you should be able to check big endian locally by running:

export NANOAROW_ARCH=s390x
docker compose run verify

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Just one note about the documentation for ArrowDecimal::n_words if you don't mind updating that before merging 🙂

@@ -3821,6 +3901,80 @@ TEST(ArrayViewTest, ArrayViewTestGetIntervalMonthDayNano) {
ArrowArrayRelease(&array);
}

TEST(ArrayViewTest, ArrayViewTestGetDecimal32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll do this in a follow-up (not your fault I didn't know how to use GTest properly when I wrote the first two!)

@@ -916,13 +922,14 @@ static inline void ArrowDecimalInit(struct ArrowDecimal* decimal, int32_t bitwid
memset(decimal->words, 0, sizeof(decimal->words));
decimal->precision = precision;
decimal->scale = scale;
// n_words will be 0 for bitwidth == 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

Can you add a note to the documentation for n_words explaining that a value of 0 is special-cased for 32-bit decimals?

/// \brief An array of 64-bit integers of n_words length defined in native-endian order
uint64_t words[4];

@paleolimbot paleolimbot merged commit e54b7df into main Nov 20, 2024
58 checks passed
@zeroshade zeroshade deleted the decimal32-decimal64 branch November 20, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants