Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1685: Truncate Min/Max for Statistics #696

Merged
merged 2 commits into from
Nov 13, 2019

Conversation

shangxinli
Copy link
Contributor

Make sure you have checked all steps below.

Jira

PARQUET-1685: Truncate Min/Max for Statistics

Tests

Added unit test
Write a Spark application test and it passed

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add the truncate mechanism to the Statistics objects (see details in the code comments). ParquetMetadataConverter might be a better place.

@shangxinli
Copy link
Contributor Author

Yes. Moved it to ParquetMetadataConverter for the write path only.

@@ -611,6 +624,18 @@ public static Statistics toParquetStatistics(
return formatStats;
}

public static void setStatisticsTruncateLength(int statisticsTruncateLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct this way. Writer has a constructor argument so you shall not set here statically. I would suggest creating a new toParquetStatistics method with this argument instead.

Copy link
Contributor Author

@shangxinli shangxinli Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know setting it here is a little hacky. The reasons for that are: 1) ParquetMetadataConverter is created as static inside ParquetFileWriter(don't' understand why). ParquetFileWriter has no chance to pass the parameter that it gets from it's constructor. 2) 'toParquetStatistics()' is defined as public static, so the constructor parameter doesn't work for it. But you suggested rewriting my own toParquetStatistics(). The concern is code duplicate.

The #2 is less concern that I can refactor the code a bit to reduce the duplicate. Any suggestions for #1? Maybe change metadataConverter to non-static? Concerns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just created another iteration by changing the metadataConverter to non-static and create my own 'toParquetStatistics'. Please have a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. I didn't want to recommend refactoring the usage of MetadataConverter for your change. I've just ment to have the truncateLength as a parameter of toParquetStatistics. You may keep the original method header as well so you do not need to rewrite all the calls only the ones are required for this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

* @param pageWriteChecksumEnabled whether to write out page level checksums
* @throws IOException if the file can not be created
*/
public ParquetFileWriter(OutputFile file, MessageType schema, Mode mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check where the previous constructor is invoked and update those code parts as well.
In addition, if this change is targeted to the next release (1.11.0) and the previous constructor was not released yet, you may simply add the new argument to it. The key is to not break API compatibility between releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I found I forgot to change the new ParquetFileWriter() inside ParquetWriter to the new signature. I will add it soon.

Since the 1.11.0 tag has been released and the current version is 1.12.0-SNAPSHOT, I would assume this will be in 1.12.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because or faulty releasing process. 1.11.0 is not release yet. 1.11.0 tag will be moved to the finally released RC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Let's just add one parameter instead of creating a new signature.

@shangxinli shangxinli force-pushed the master branch 5 times, most recently from 18f938c to c33cd36 Compare November 5, 2019 18:36
@Test
public void testBinaryStatsWithTruncation() {
int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH;
int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you test defaultTruncLen -1 twice? I guess, defaultTrundLen itself shall also be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That is a typo, I meant to test defaultTruncLen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- 1 is still there. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know how it is slipped. Fix it now.

testBinaryStatsWithTruncation(len);
}

int[] invalidLengths = {-1, 0, defaultTruncLen + 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultTruncLen + 1 is invalid only because ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH is Integer.MAX_VALUE so it overflows. Meanwhile, semantically a larger value of the default one shall be supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I will make it as "-1, 0, Integer.MAX_VALUE + 1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know how it is slipped. Fix it now.

@@ -591,9 +609,17 @@ public static Statistics toParquetStatistics(
if (!stats.isEmpty() && stats.isSmallerThan(MAX_STATS_SIZE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are not truncating if the size is too large? It should be the other way around: check the size after the truncate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

private void testBinaryStatsWithTruncation(int truncateLen) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test data where we would reach the size limit of statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@Test
public void testBinaryStatsWithTruncation() {
int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH;
int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- 1 is still there. :)

testBinaryStatsWithTruncation(len);
}

int[] invalidLengths = {-1, 0, defaultTruncLen + 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not fixed.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more finding.
The build/test fails because of PARQUET-1691. Please, wait with your last change for the PR #698 to get in so we can re-trigger the build and we will have all green lights.

for (int len : invalidLengths) {
try {
testBinaryStatsWithTruncation(len, 80, 20);
} catch (IllegalArgumentException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would not fail if no exception is thrown.

Suggested change
} catch (IllegalArgumentException e) {
Assert.fail("...");
} catch (IllegalArgumentException e) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants