Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: Bump Apache Parquet 1.14.4 #11502

Merged
merged 6 commits into from
Nov 20, 2024
Merged

Build: Bump Apache Parquet 1.14.4 #11502

merged 6 commits into from
Nov 20, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Nov 8, 2024

No description provided.

@Fokko Fokko changed the title Test out Apache Parquet 1.14.4 Test out Apache Parquet 1.14.4 RC2 Nov 8, 2024
Comment on lines 226 to 228
Row booleanCol = Row.of(36L, 4L, 0L, null, false, true);
Row decimalCol = Row.of(91L, 4L, 1L, null, new BigDecimal("1.00"), new BigDecimal("2.00"));
Row doubleCol = Row.of(91L, 4L, 0L, 1L, 1.0D, 2.0D);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] should we refactor this to pick file_size from the Datafiles themselve like we did we did in JDK 17 upgrade PR #7391 (comment)

Never the less looks like size in bytes is increasing in this version is it because they are more accurate now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @singhpk234, that's an excellent suggestion. I've copied your approach here as well. Parquet now also tracks how large the data is in memory after compression (this is handy for strings where you don't know that upfront) so you can allocate buffers directly to the right size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large the data is in memory after compression (this is handy for strings where you don't know that upfront) so you can allocate buffers directly to the right size.

This is precisely what we needed in Redshift as well, our CBO was falling behind with variable length data types, will give them HeadsUp ! Thankyou @Fokko

@jbonofre jbonofre self-requested a review November 9, 2024 09:38
Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parquet update looks good, I'm just wondering about the row size increase in the test. I would add at least in the comment in the test to explain the reason.

@Fokko Fokko changed the title Test out Apache Parquet 1.14.4 RC2 Build: Bump Apache Parquet 1.14.4 Nov 12, 2024
@Fokko Fokko marked this pull request as ready for review November 12, 2024 12:45
@Fokko Fokko mentioned this pull request Nov 13, 2024
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @Fokko !

@Fokko Fokko requested a review from nastra November 19, 2024 11:14
// size of the column to increase. For example, with Parquet 1.14.x the
// uncompressed size has been added to allow for better allocation of memory upfront.
// Therefore, we look the sizes up, rather than hardcoding them
DataFile dataFile = table.currentSnapshot().addedDataFiles(table.io()).iterator().next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that we're assuming only a single file, so we might as well use Iterables.getOnlyElement(table.currentSnapshot().addedDataFiles(table.io()))

@Fokko Fokko merged commit 657fa86 into apache:main Nov 20, 2024
49 checks passed
@Fokko Fokko deleted the fd-parq branch November 20, 2024 08:35
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
* Revert "Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (apache#11264)" (apache#11462)"

This reverts commit 7cc16fa.

* Bump to Parquet 1.14.4

* Lookup sizes instead

* Update build.gradle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants