Support decimal int32/64 for writer #3431

liukun4515 · 2023-01-03T08:47:03Z

Which issue does this PR close?

Closes #205

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

…nt32/64

liukun4515 · 2023-01-03T08:47:33Z

cc @alamb

tustvold · 2023-01-03T11:14:03Z

parquet/src/arrow/arrow_writer/mod.rs

+                                .iter()
+                                .map(|v| v.map(|v| v as i32))
+                                .collect::<Int32Array>();


Suggested change

.iter()

.map(|v| v.map(|v| v as i32))

.collect::<Int32Array>();

.unary::<_, Int32Array>(|v| v as i32);

tustvold · 2023-01-03T11:14:24Z

parquet/src/arrow/arrow_writer/mod.rs

+                                .iter()
+                                .map(|v| v.map(|v| v as i64))
+                                .collect::<Int64Array>();


Suggested change

.iter()

.map(|v| v.map(|v| v as i64))

.collect::<Int64Array>();

.unary::<_, Int64Array>(|v| v as i64);

tustvold · 2023-01-03T11:17:11Z

What is the ecosystem support for this like? Do all arrow implementations understand how to convert to a decimal128 from i32 or i64? Just wondering if we need to put this behind an optional flag?

liukun4515 · 2023-01-04T02:44:48Z

What is the ecosystem support for this like? Do all arrow implementations understand how to convert to a decimal128 from i32 or i64? Just wondering if we need to put this behind an optional flag?

This implementation just contains the writer path for parquet file for decimal data, and
just following the definition of parquet format: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

DECIMAL can be used to annotate the following types:

int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Do all arrow implementations understand how to convert to a decimal128 from i32 or i64?

I think the data in the arrow ecosystem is exchanged by IPC format for the different language like rust -> java, or c++ -> rust.

cc @tustvold

liukun4515 · 2023-01-04T02:58:26Z

I also find the schema mapping in java version of the parquet-mr project: https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java#L227, and it supports the mapping from arrow decimal type to INT32/INT64 parquet physical type.

But find the implementation of c++ version arrow: https://arrow.apache.org/docs/cpp/parquet.html#logical-types, and find some notes about the arrow decimal:



DECIMAL | INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | (2)

(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.

…nt32/64

tustvold · 2023-01-04T07:44:35Z

I think the data in the arrow ecosystem is exchanged by IPC format

Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another

To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't.

Can you confirm pyarrow at least can correctly read the data written by this PR?

liukun4515 · 2023-01-04T08:30:33Z

I think the data in the arrow ecosystem is exchanged by IPC format

Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another

why is it related to other file format?
The changes just enhance the writing for parquet file format, and it will not impact the CSV and other file format.

To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't.

Can you confirm pyarrow at least can correctly read the data written by this PR?

From https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc#L1227, the c++ support reading the decimal data from INT32/INT64, but c++ does not support writing decimal using the INT32/INT64 parquet physical type https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L811, this is consistent with the comments for the arrow writing parquet.

DECIMAL | INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | (2)

(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.

The writing path of go is same with the c++.
go: https://github.com/apache/arrow/blob/master/go/parquet/pqarrow/schema.go#L303
But I can't find the writing path for the pyarrow. @tustvold

But all languages support reading the decimal from INT32/INT64/FIXED_BYTE_ARRAY/BYTE_ARRAY from parquet file.

liukun4515 · 2023-01-04T08:35:49Z

If we need to following other language of arrow, we should close this pr.
Thanks @tustvold
But I want to know why we not implement writing decimal with INT32 or INT64 for parquet file, maybe it is a matter of history, and I will file a email about this in the dev mail list.

tustvold · 2023-01-04T08:37:26Z

I will file a email about this in the dev mail list.

I would be very interested in why arrow C++ doesn't write to Int32/Int64

alamb

So what is this PR waiting on? Some demonstration that parquet files written with decimal and smaller field width can be read by some other parquet implementation?

alamb · 2023-01-04T18:40:53Z

Looks like @nevi-me filed the original ticket -- I wonder if he has any additional context?

liukun4515 · 2023-01-05T02:56:30Z

So what is this PR waiting on? Some demonstration that parquet files written with decimal and smaller field width can be read by some other parquet implementation?

go/c++ arrow can read decimal from INT32/INT64 parquet physical type.

I have filed a email to talk about this in C++ version and the notes in the document, but until now I have not got the response.

tustvold

I think let's move forward with this, and we can reassess if we hear anything back

tustvold · 2023-01-05T07:48:34Z

parquet/src/arrow/arrow_writer/mod.rs

+        {
+            match column.data_type() {
+                // if the arrow data type is decimal
+                ArrowDataType::Decimal128(_, _) => {


I think this logic should be moved to write_leaf where we have other coercion logic, this will also be simpler

nevi-me · 2023-01-05T07:57:39Z

Looks like @nevi-me filed the original ticket -- I wonder if he has any additional context?

Hey @alamb @tustvold

My recollection is that the idea was that if the precision is low enough, Parquet can write the file to i32 or i64 physicla types to result in smaller files (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).

There's 2 scenarios:

When reading from Arrow > Parquet, should the behaviour to use the smallest possible physical type based on precision?
When reading from Parquet > Arrow, I suppose Decimal128 makes sense.

liukun4515 · 2023-01-07T14:10:05Z

I think let's move forward with this, and we can reassess if we hear anything back

From the feedback of the email, the arrow of c++ has the plan to support this. apache/arrow#15239

Can we going on this PR？
@tustvold @alamb

…nt32/64

liukun4515 · 2023-01-10T13:23:14Z

parquet/src/arrow/schema/mod.rs

+                .with_scale(*scale as i32)
+                .build()
+        }
+        DataType::Decimal256(precision, scale) => {


Now we ignore the decimal 256

liukun4515 · 2023-01-10T13:48:26Z

PTAL @tustvold @alamb

tustvold

Just a minor nit, thank you for sticking with this

tustvold · 2023-01-10T14:25:22Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -435,6 +444,15 @@ fn write_leaf(
                    let array: &[i64] = data.buffers()[0].typed_data();
                    write_primitive(typed, &array[offset..offset + data.len()], levels)?
                }
+                ArrowDataType::Decimal128(_, _) => {
+                    // use the int32 to represent the decimal with low precision
+                    let array = column


You could consider using as_primitive_array here

ursabot · 2023-01-11T02:22:07Z

Benchmark runs are scheduled for baseline = a8276c0 and contender = ccb80e8. ccb80e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alippai · 2023-01-11T18:38:05Z

Is this backwards compatible for datasets? Eg if 2022.parquet was written the old way, 2023.parquet was using the new physical type, can I query/read both with the mixed physical types?

tustvold · 2023-01-11T20:14:27Z

Yes, this should be transparent to users

alippai · 2023-01-11T20:29:44Z

I was asking because of the hybrid schema. It’s good it depends on the logical schema instead of the physical types 🎉
Thanks for the swift reply

liukun4515 added 2 commits January 3, 2023 16:34

support lower precision decimal to int32 and int64

228b6f2

Merge remote-tracking branch 'upstream/master' into support_decimal_i…

8d5259a

…nt32/64

github-actions bot added the parquet Changes to the parquet crate label Jan 3, 2023

tustvold reviewed Jan 3, 2023

View reviewed changes

tustvold added the api-change Changes to the arrow API label Jan 3, 2023

liukun4515 added 4 commits January 4, 2023 11:15

Merge remote-tracking branch 'upstream/master' into support_decimal_i…

0e0fcc0

…nt32/64

address comments

ea0fdd7

fix the failed case

681b6f7

fix clippy

0526806

alamb reviewed Jan 4, 2023

View reviewed changes

tustvold reviewed Jan 5, 2023

View reviewed changes

liukun4515 added 2 commits January 10, 2023 21:02

Merge remote-tracking branch 'upstream/master' into support_decimal_i…

0d7e548

…nt32/64

fix comments

ec127b0

liukun4515 commented Jan 10, 2023

View reviewed changes

fix clipy

7c50749

tustvold approved these changes Jan 10, 2023

View reviewed changes

liukun4515 added 2 commits January 11, 2023 09:42

use as_primitive_array replace the downcast

cfac93d

fix clippy

09828a0

liukun4515 merged commit ccb80e8 into apache:master Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support decimal int32/64 for writer #3431

Support decimal int32/64 for writer #3431

liukun4515 commented Jan 3, 2023

liukun4515 commented Jan 3, 2023

tustvold Jan 3, 2023

tustvold Jan 3, 2023

tustvold commented Jan 3, 2023

liukun4515 commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

tustvold commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

tustvold commented Jan 4, 2023

alamb left a comment

alamb commented Jan 4, 2023

liukun4515 commented Jan 5, 2023

tustvold left a comment

tustvold Jan 5, 2023 •

edited

Loading

nevi-me commented Jan 5, 2023

liukun4515 commented Jan 7, 2023

liukun4515 Jan 10, 2023

liukun4515 commented Jan 10, 2023

tustvold left a comment

tustvold Jan 10, 2023

ursabot commented Jan 11, 2023

alippai commented Jan 11, 2023

tustvold commented Jan 11, 2023

alippai commented Jan 11, 2023

Support decimal int32/64 for writer #3431

Support decimal int32/64 for writer #3431

Conversation

liukun4515 commented Jan 3, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 commented Jan 3, 2023

tustvold Jan 3, 2023

Choose a reason for hiding this comment

tustvold Jan 3, 2023

Choose a reason for hiding this comment

tustvold commented Jan 3, 2023

liukun4515 commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

tustvold commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

liukun4515 commented Jan 4, 2023

tustvold commented Jan 4, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 4, 2023

liukun4515 commented Jan 5, 2023

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

nevi-me commented Jan 5, 2023

liukun4515 commented Jan 7, 2023

liukun4515 Jan 10, 2023

Choose a reason for hiding this comment

liukun4515 commented Jan 10, 2023

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 10, 2023

Choose a reason for hiding this comment

ursabot commented Jan 11, 2023

alippai commented Jan 11, 2023

tustvold commented Jan 11, 2023

alippai commented Jan 11, 2023

tustvold Jan 5, 2023 •

edited

Loading