-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1685: Truncate Min/Max for Statistics #696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not add the truncate mechanism to the Statistics objects (see details in the code comments). ParquetMetadataConverter might be a better place.
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
Yes. Moved it to ParquetMetadataConverter for the write path only. |
@@ -611,6 +624,18 @@ public static Statistics toParquetStatistics( | |||
return formatStats; | |||
} | |||
|
|||
public static void setStatisticsTruncateLength(int statisticsTruncateLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct this way. Writer has a constructor argument so you shall not set here statically. I would suggest creating a new toParquetStatistics
method with this argument instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know setting it here is a little hacky. The reasons for that are: 1) ParquetMetadataConverter is created as static inside ParquetFileWriter(don't' understand why). ParquetFileWriter has no chance to pass the parameter that it gets from it's constructor. 2) 'toParquetStatistics()' is defined as public static, so the constructor parameter doesn't work for it. But you suggested rewriting my own toParquetStatistics(). The concern is code duplicate.
The #2 is less concern that I can refactor the code a bit to reduce the duplicate. Any suggestions for #1? Maybe change metadataConverter to non-static? Concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just created another iteration by changing the metadataConverter to non-static and create my own 'toParquetStatistics'. Please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response. I didn't want to recommend refactoring the usage of MetadataConverter
for your change. I've just ment to have the truncateLength as a parameter of toParquetStatistics
. You may keep the original method header as well so you do not need to rewrite all the calls only the ones are required for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
* @param pageWriteChecksumEnabled whether to write out page level checksums | ||
* @throws IOException if the file can not be created | ||
*/ | ||
public ParquetFileWriter(OutputFile file, MessageType schema, Mode mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check where the previous constructor is invoked and update those code parts as well.
In addition, if this change is targeted to the next release (1.11.0) and the previous constructor was not released yet, you may simply add the new argument to it. The key is to not break API compatibility between releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I found I forgot to change the new ParquetFileWriter() inside ParquetWriter to the new signature. I will add it soon.
Since the 1.11.0 tag has been released and the current version is 1.12.0-SNAPSHOT, I would assume this will be in 1.12.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because or faulty releasing process. 1.11.0 is not release yet. 1.11.0 tag will be moved to the finally released RC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Let's just add one parameter instead of creating a new signature.
18f938c
to
c33cd36
Compare
@Test | ||
public void testBinaryStatsWithTruncation() { | ||
int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH; | ||
int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you test defaultTruncLen -1
twice? I guess, defaultTrundLen
itself shall also be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That is a typo, I meant to test defaultTruncLen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 1
is still there. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know how it is slipped. Fix it now.
testBinaryStatsWithTruncation(len); | ||
} | ||
|
||
int[] invalidLengths = {-1, 0, defaultTruncLen + 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultTruncLen + 1
is invalid only because ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH
is Integer.MAX_VALUE
so it overflows. Meanwhile, semantically a larger value of the default one shall be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I will make it as "-1, 0, Integer.MAX_VALUE + 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know how it is slipped. Fix it now.
@@ -591,9 +609,17 @@ public static Statistics toParquetStatistics( | |||
if (!stats.isEmpty() && stats.isSmallerThan(MAX_STATS_SIZE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are not truncating if the size is too large? It should be the other way around: check the size after the truncate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
} | ||
|
||
private void testBinaryStatsWithTruncation(int truncateLen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test data where we would reach the size limit of statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@Test | ||
public void testBinaryStatsWithTruncation() { | ||
int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH; | ||
int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 1
is still there. :)
testBinaryStatsWithTruncation(len); | ||
} | ||
|
||
int[] invalidLengths = {-1, 0, defaultTruncLen + 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not fixed.
...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more finding.
The build/test fails because of PARQUET-1691. Please, wait with your last change for the PR #698 to get in so we can re-trigger the build and we will have all green lights.
for (int len : invalidLengths) { | ||
try { | ||
testBinaryStatsWithTruncation(len, 80, 20); | ||
} catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would not fail if no exception is thrown.
} catch (IllegalArgumentException e) { | |
Assert.fail("..."); | |
} catch (IllegalArgumentException e) { |
Make sure you have checked all steps below.
Jira
PARQUET-1685: Truncate Min/Max for Statistics
Tests
Added unit test
Write a Spark application test and it passed
Commits
Documentation