Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

Merged
merged 2 commits into from
May 26, 2021

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented May 21, 2021

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@sunchao
Copy link
Member Author

sunchao commented May 21, 2021

@gszadovszky @shangxinli @ggershinsky could you take a look? thanks!

The CI failed cuz variable type changed - not sure why this should be disallowed.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one note about the downcast. Please check.

About the failures. This is indeed an incompatible change since it is done on a protected field in a public class. I can see two options to workaround this.

  1. Keep the field as an int and check the overflow afterwards at any places where dictionaryByteSize is used. I don't really like this idea myself. It might be risky and definitely not clean but it would not trigger the compatibility checker.
  2. Exclude the related field from compatibility checker. I don't have too much experience on this but there are examples in the doc of the plugin. If you choose this solution the proper comments for the exclusion (in the pom.xml) would be required. It would also be nice to mention that after a minor release we shall remove that exclusion.

@@ -173,7 +173,7 @@ public BytesInput getBytes() {
BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes);
// remember size of dictionary when we last wrote a page
lastUsedDictionarySize = getDictionarySize();
lastUsedDictionaryByteSize = dictionaryByteSize;
lastUsedDictionaryByteSize = (int) dictionaryByteSize;
Copy link
Contributor

@gszadovszky gszadovszky May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should not be called in case of shouldFallBack() returns true but we should be on the safe side and check the potential overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Changed to Math.toIntExact so it will throw exception when overflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that user continues writing that large string, will here throw an exception? And it will block the writer, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. With the fix, it should've already fallback to PLAIN encoding during writeBytes before getting here.

@sunchao
Copy link
Member Author

sunchao commented May 22, 2021

@gszadovszky I took your second recommendation and disabled the compatibility check for the field. Also added a TODO there to remove it once next minor release is done. Could you take another look? thanks.

@advancedxy
Copy link

@sunchao Hi, do you encounter this in production?

@sunchao
Copy link
Member Author

sunchao commented May 23, 2021

@advancedxy yes, one of our users wrote very large strings and ended up with corrupted Parquet files because of this issue.

@shangxinli
Copy link
Contributor

@gszadovszky Do you think we should fall back to no dictionary encoding in case of large byte size?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @sunchao for your efforts!

@shangxinli, this case of large byte size is an overflow meaning the dictionary size would be larger than it could be represented as an int. Meanwhile in the format page header both compressed and uncompressed sizes are represented as i32 values so the current dictionary would be out of its limits anyway.
It is another question if all the encoders used for writing data pages for binary values are prepared for similar situations. I would leave this question for @sunchao since this issue seems a rare situation (not aware of similar jiras while parquet-mr works like this since the beginning) .

@sunchao
Copy link
Member Author

sunchao commented May 25, 2021

Thanks @gszadovszky and @shangxinli for taking a look.

It is another question if all the encoders used for writing data pages for binary values are prepared for similar situations.

After fallback from dictionary encoding, I think it will fail later during writing data page. I think this is still better than generating corrupted page.

@gszadovszky
Copy link
Contributor

I agree, @sunchao that failing with a proper exception is always better than committing a corrupt file. Any issues might happen after the fallback should be handled in a separate jira.

@gszadovszky gszadovszky merged commit 819443b into apache:master May 26, 2021
elikkatz added a commit to TheWeatherCompany/parquet-mr that referenced this pull request Jun 2, 2021
* 'master' of https://github.com/apache/parquet-mr: (222 commits)
  PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910)
  PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899)
  PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908)
  PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745)
  PARQUET-2046: Upgrade Apache POM to 23 (apache#904)
  PARQUET-2048: Deprecate BaseRecordReader (apache#906)
  PARQUET-1922: Deprecate IOExceptionUtils (apache#825)
  PARQUET-2037: Write INT96 with parquet-avro (apache#901)
  PARQUET-2044: Enable ZSTD buffer pool by default (apache#903)
  PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898)
  Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)"
  PARQUET-2027: Fix calculating directory offset for merge (apache#896)
  [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)
  PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895)
  PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897)
  PARQUET-1448: Review of ParquetFileReader (apache#892)
  PARQUET-2020: Remove deprecated modules (apache#888)
  PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893)
  PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889)
  PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871)
  ...

# Conflicts:
#	parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java
#	parquet-hadoop/pom.xml
#	parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
#	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Sep 9, 2021
shangxinli pushed a commit that referenced this pull request Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants