Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support decimal int32/64 for writer #3431

Merged
merged 11 commits into from
Jan 11, 2023

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #205

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 3, 2023
@liukun4515
Copy link
Contributor Author

cc @alamb

Comment on lines 165 to 167
.iter()
.map(|v| v.map(|v| v as i32))
.collect::<Int32Array>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.iter()
.map(|v| v.map(|v| v as i32))
.collect::<Int32Array>();
.unary::<_, Int32Array>(|v| v as i32);

Comment on lines 175 to 177
.iter()
.map(|v| v.map(|v| v as i64))
.collect::<Int64Array>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.iter()
.map(|v| v.map(|v| v as i64))
.collect::<Int64Array>();
.unary::<_, Int64Array>(|v| v as i64);

@tustvold tustvold added the api-change Changes to the arrow API label Jan 3, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2023

What is the ecosystem support for this like? Do all arrow implementations understand how to convert to a decimal128 from i32 or i64? Just wondering if we need to put this behind an optional flag?

@liukun4515
Copy link
Contributor Author

What is the ecosystem support for this like? Do all arrow implementations understand how to convert to a decimal128 from i32 or i64? Just wondering if we need to put this behind an optional flag?

This implementation just contains the writer path for parquet file for decimal data, and
just following the definition of parquet format: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

DECIMAL can be used to annotate the following types:

int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning
fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

Do all arrow implementations understand how to convert to a decimal128 from i32 or i64?

I think the data in the arrow ecosystem is exchanged by IPC format for the different language like rust -> java, or c++ -> rust.

cc @tustvold

@liukun4515
Copy link
Contributor Author

I also find the schema mapping in java version of the parquet-mr project: https://github.com/apache/parquet-mr/blob/433de8df33fcf31927f7b51456be9f53e64d48b9/parquet-arrow/src/main/java/org/apache/parquet/arrow/schema/SchemaConverter.java#L227, and it supports the mapping from arrow decimal type to INT32/INT64 parquet physical type.

But find the implementation of c++ version arrow: https://arrow.apache.org/docs/cpp/parquet.html#logical-types, and find some notes about the arrow decimal:



DECIMAL | INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | (2)

(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

I think the data in the arrow ecosystem is exchanged by IPC format

Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another

To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't.

Can you confirm pyarrow at least can correctly read the data written by this PR?

@liukun4515
Copy link
Contributor Author

I think the data in the arrow ecosystem is exchanged by IPC format

Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another

why is it related to other file format?
The changes just enhance the writing for parquet file format, and it will not impact the CSV and other file format.

To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't.

Can you confirm pyarrow at least can correctly read the data written by this PR?

From https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc#L1227, the c++ support reading the decimal data from INT32/INT64, but c++ does not support writing decimal using the INT32/INT64 parquet physical type https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L811, this is consistent with the comments for the arrow writing parquet.

DECIMAL | INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | (2)

(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.

The writing path of go is same with the c++.
go: https://github.com/apache/arrow/blob/master/go/parquet/pqarrow/schema.go#L303
But I can't find the writing path for the pyarrow. @tustvold

But all languages support reading the decimal from INT32/INT64/FIXED_BYTE_ARRAY/BYTE_ARRAY from parquet file.

@liukun4515
Copy link
Contributor Author

If we need to following other language of arrow, we should close this pr.
Thanks @tustvold
But I want to know why we not implement writing decimal with INT32 or INT64 for parquet file, maybe it is a matter of history, and I will file a email about this in the dev mail list.

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2023

I will file a email about this in the dev mail list.

I would be very interested in why arrow C++ doesn't write to Int32/Int64

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is this PR waiting on? Some demonstration that parquet files written with decimal and smaller field width can be read by some other parquet implementation?

@alamb
Copy link
Contributor

alamb commented Jan 4, 2023

Looks like @nevi-me filed the original ticket -- I wonder if he has any additional context?

@liukun4515
Copy link
Contributor Author

So what is this PR waiting on? Some demonstration that parquet files written with decimal and smaller field width can be read by some other parquet implementation?

go/c++ arrow can read decimal from INT32/INT64 parquet physical type.

I have filed a email to talk about this in C++ version and the notes in the document, but until now I have not got the response.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's move forward with this, and we can reassess if we hear anything back

{
match column.data_type() {
// if the arrow data type is decimal
ArrowDataType::Decimal128(_, _) => {
Copy link
Contributor

@tustvold tustvold Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic should be moved to write_leaf where we have other coercion logic, this will also be simpler

@nevi-me
Copy link
Contributor

nevi-me commented Jan 5, 2023

Looks like @nevi-me filed the original ticket -- I wonder if he has any additional context?

Hey @alamb @tustvold

My recollection is that the idea was that if the precision is low enough, Parquet can write the file to i32 or i64 physicla types to result in smaller files (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).

There's 2 scenarios:

  • When reading from Arrow > Parquet, should the behaviour to use the smallest possible physical type based on precision?
  • When reading from Parquet > Arrow, I suppose Decimal128 makes sense.

@liukun4515
Copy link
Contributor Author

I think let's move forward with this, and we can reassess if we hear anything back

From the feedback of the email, the arrow of c++ has the plan to support this. apache/arrow#15239

Can we going on this PR?
@tustvold @alamb

.with_scale(*scale as i32)
.build()
}
DataType::Decimal256(precision, scale) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we ignore the decimal 256

@liukun4515
Copy link
Contributor Author

PTAL @tustvold @alamb

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor nit, thank you for sticking with this

@@ -435,6 +444,15 @@ fn write_leaf(
let array: &[i64] = data.buffers()[0].typed_data();
write_primitive(typed, &array[offset..offset + data.len()], levels)?
}
ArrowDataType::Decimal128(_, _) => {
// use the int32 to represent the decimal with low precision
let array = column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider using as_primitive_array here

@liukun4515 liukun4515 merged commit ccb80e8 into apache:master Jan 11, 2023
@ursabot
Copy link

ursabot commented Jan 11, 2023

Benchmark runs are scheduled for baseline = a8276c0 and contender = ccb80e8. ccb80e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alippai
Copy link
Contributor

alippai commented Jan 11, 2023

Is this backwards compatible for datasets? Eg if 2022.parquet was written the old way, 2023.parquet was using the new physical type, can I query/read both with the mixed physical types?

@tustvold
Copy link
Contributor

Yes, this should be transparent to users

@alippai
Copy link
Contributor

alippai commented Jan 11, 2023

I was asking because of the hybrid schema. It’s good it depends on the logical schema instead of the physical types 🎉
Thanks for the swift reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write lower precision Arrow decimal to int32/64
6 participants