-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build: Bump Apache Parquet 1.14.4 #11502
Conversation
…)" (apache#11462)" This reverts commit 7cc16fa.
Row booleanCol = Row.of(36L, 4L, 0L, null, false, true); | ||
Row decimalCol = Row.of(91L, 4L, 1L, null, new BigDecimal("1.00"), new BigDecimal("2.00")); | ||
Row doubleCol = Row.of(91L, 4L, 0L, 1L, 1.0D, 2.0D); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] should we refactor this to pick file_size from the Datafiles themselve like we did we did in JDK 17 upgrade PR #7391 (comment)
Never the less looks like size in bytes is increasing in this version is it because they are more accurate now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @singhpk234, that's an excellent suggestion. I've copied your approach here as well. Parquet now also tracks how large the data is in memory after compression (this is handy for strings where you don't know that upfront) so you can allocate buffers directly to the right size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how large the data is in memory after compression (this is handy for strings where you don't know that upfront) so you can allocate buffers directly to the right size.
This is precisely what we needed in Redshift as well, our CBO was falling behind with variable length data types, will give them HeadsUp ! Thankyou @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Parquet update looks good, I'm just wondering about the row size increase in the test. I would add at least in the comment in the test to explain the reason.
...18/flink/src/test/java/org/apache/iceberg/flink/source/TestMetadataTableReadableMetrics.java
Outdated
Show resolved
Hide resolved
...18/flink/src/test/java/org/apache/iceberg/flink/source/TestMetadataTableReadableMetrics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @Fokko !
// size of the column to increase. For example, with Parquet 1.14.x the | ||
// uncompressed size has been added to allow for better allocation of memory upfront. | ||
// Therefore, we look the sizes up, rather than hardcoding them | ||
DataFile dataFile = table.currentSnapshot().addedDataFiles(table.io()).iterator().next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that we're assuming only a single file, so we might as well use Iterables.getOnlyElement(table.currentSnapshot().addedDataFiles(table.io()))
* Revert "Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (apache#11264)" (apache#11462)" This reverts commit 7cc16fa. * Bump to Parquet 1.14.4 * Lookup sizes instead * Update build.gradle
No description provided.