-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support decimal int32/64 for writer #3431
Support decimal int32/64 for writer #3431
Conversation
cc @alamb |
.iter() | ||
.map(|v| v.map(|v| v as i32)) | ||
.collect::<Int32Array>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.iter() | |
.map(|v| v.map(|v| v as i32)) | |
.collect::<Int32Array>(); | |
.unary::<_, Int32Array>(|v| v as i32); |
.iter() | ||
.map(|v| v.map(|v| v as i64)) | ||
.collect::<Int64Array>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.iter() | |
.map(|v| v.map(|v| v as i64)) | |
.collect::<Int64Array>(); | |
.unary::<_, Int64Array>(|v| v as i64); |
What is the ecosystem support for this like? Do all arrow implementations understand how to convert to a decimal128 from i32 or i64? Just wondering if we need to put this behind an optional flag? |
This implementation just contains the writer path for parquet file for decimal data, and
I think the data in the arrow ecosystem is exchanged by IPC format for the different language like cc @tustvold |
I also find the schema mapping in java version of the But find the implementation of c++ version arrow: https://arrow.apache.org/docs/cpp/parquet.html#logical-types, and find some notes about the arrow decimal:
|
Sometimes, but an important property is that data written by one implementation to CSV, Parquet, or whatever can be read by another To phrase my concern differently, decimals are a relatively esoteric type, with most arrow implementations having limited support. I worry with this PR we will now write decimal data in such a way arrow implementations that used to understand it, now won't. Can you confirm pyarrow at least can correctly read the data written by this PR? |
why is it related to other file format?
From https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc#L1227, the c++ support reading the decimal data from INT32/INT64, but c++ does not support writing decimal using the INT32/INT64 parquet physical type https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L811, this is consistent with the comments for the arrow writing parquet.
The writing path of go is same with the c++. But all languages support reading the decimal from INT32/INT64/FIXED_BYTE_ARRAY/BYTE_ARRAY from parquet file. |
If we need to following other language of arrow, we should close this pr. |
I would be very interested in why arrow C++ doesn't write to Int32/Int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is this PR waiting on? Some demonstration that parquet files written with decimal and smaller field width can be read by some other parquet implementation?
Looks like @nevi-me filed the original ticket -- I wonder if he has any additional context? |
go/c++ arrow can read decimal from INT32/INT64 parquet physical type. I have filed a email to talk about this in C++ version and the notes in the document, but until now I have not got the response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think let's move forward with this, and we can reassess if we hear anything back
{ | ||
match column.data_type() { | ||
// if the arrow data type is decimal | ||
ArrowDataType::Decimal128(_, _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic should be moved to write_leaf where we have other coercion logic, this will also be simpler
My recollection is that the idea was that if the precision is low enough, Parquet can write the file to i32 or i64 physicla types to result in smaller files (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal). There's 2 scenarios:
|
From the feedback of the email, the arrow of c++ has the plan to support this. apache/arrow#15239 |
.with_scale(*scale as i32) | ||
.build() | ||
} | ||
DataType::Decimal256(precision, scale) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we ignore the decimal 256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor nit, thank you for sticking with this
@@ -435,6 +444,15 @@ fn write_leaf( | |||
let array: &[i64] = data.buffers()[0].typed_data(); | |||
write_primitive(typed, &array[offset..offset + data.len()], levels)? | |||
} | |||
ArrowDataType::Decimal128(_, _) => { | |||
// use the int32 to represent the decimal with low precision | |||
let array = column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could consider using as_primitive_array here
Benchmark runs are scheduled for baseline = a8276c0 and contender = ccb80e8. ccb80e8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Is this backwards compatible for datasets? Eg if 2022.parquet was written the old way, 2023.parquet was using the new physical type, can I query/read both with the mixed physical types? |
Yes, this should be transparent to users |
I was asking because of the hybrid schema. It’s good it depends on the logical schema instead of the physical types 🎉 |
Which issue does this PR close?
Closes #205
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?