PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

sunchao · 2021-05-21T02:01:11Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

sunchao · 2021-05-21T03:10:11Z

@gszadovszky @shangxinli @ggershinsky could you take a look? thanks!

The CI failed cuz variable type changed - not sure why this should be disallowed.

gszadovszky

I have only one note about the downcast. Please check.

About the failures. This is indeed an incompatible change since it is done on a protected field in a public class. I can see two options to workaround this.

Keep the field as an int and check the overflow afterwards at any places where dictionaryByteSize is used. I don't really like this idea myself. It might be risky and definitely not clean but it would not trigger the compatibility checker.
Exclude the related field from compatibility checker. I don't have too much experience on this but there are examples in the doc of the plugin. If you choose this solution the proper comments for the exclusion (in the pom.xml) would be required. It would also be nice to mention that after a minor release we shall remove that exclusion.

gszadovszky · 2021-05-21T10:03:37Z

...column/src/main/java/org/apache/parquet/column/values/dictionary/DictionaryValuesWriter.java

@@ -173,7 +173,7 @@ public BytesInput getBytes() {
      BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes);
      // remember size of dictionary when we last wrote a page
      lastUsedDictionarySize = getDictionarySize();
-      lastUsedDictionaryByteSize = dictionaryByteSize;
+      lastUsedDictionaryByteSize = (int) dictionaryByteSize;


This method should not be called in case of shouldFallBack() returns true but we should be on the safe side and check the potential overflow.

Okay. Changed to Math.toIntExact so it will throw exception when overflow.

If that user continues writing that large string, will here throw an exception? And it will block the writer, right?

I don't think so. With the fix, it should've already fallback to PLAIN encoding during writeBytes before getting here.

sunchao · 2021-05-22T16:16:30Z

@gszadovszky I took your second recommendation and disabled the compatibility check for the field. Also added a TODO there to remove it once next minor release is done. Could you take another look? thanks.

advancedxy · 2021-05-23T05:05:38Z

@sunchao Hi, do you encounter this in production?

sunchao · 2021-05-23T15:45:56Z

@advancedxy yes, one of our users wrote very large strings and ended up with corrupted Parquet files because of this issue.

shangxinli · 2021-05-24T02:58:08Z

@gszadovszky Do you think we should fall back to no dictionary encoding in case of large byte size?

gszadovszky

Thanks, @sunchao for your efforts!

@shangxinli, this case of large byte size is an overflow meaning the dictionary size would be larger than it could be represented as an int. Meanwhile in the format page header both compressed and uncompressed sizes are represented as i32 values so the current dictionary would be out of its limits anyway.
It is another question if all the encoders used for writing data pages for binary values are prepared for similar situations. I would leave this question for @sunchao since this issue seems a rare situation (not aware of similar jiras while parquet-mr works like this since the beginning) .

sunchao · 2021-05-25T16:42:18Z

Thanks @gszadovszky and @shangxinli for taking a look.

It is another question if all the encoders used for writing data pages for binary values are prepared for similar situations.

After fallback from dictionary encoding, I think it will fail later during writing data page. I think this is still better than generating corrupted page.

gszadovszky · 2021-05-26T07:43:10Z

I agree, @sunchao that failing with a proper exception is always better than committing a corrupt file. Any issues might happen after the fallback should be handled in a separate jira.

* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

…ary encoding (apache#910)

…ary encoding (#910)

first commit

feae733

gszadovszky requested changes May 21, 2021

View reviewed changes

address comment & disable mvn check

cdffa05

gszadovszky approved these changes May 25, 2021

View reviewed changes

gszadovszky merged commit 819443b into apache:master May 26, 2021

SamWheating mentioned this pull request Jun 25, 2021

Failure when using optimized Parquet writer: ArrayIndexOutOfBoundsException: Index 128901 out of bounds for length 1 trinodb/trino#5518

Closed

shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Sep 9, 2021

PARQUET-2052: Integer overflow when writing huge binary using diction…

616d1b2

…ary encoding (apache#910)

shangxinli pushed a commit that referenced this pull request Sep 9, 2021

PARQUET-2052: Integer overflow when writing huge binary using diction…

7597c74

…ary encoding (#910)

moeinxyz mentioned this pull request Sep 7, 2023

fix: Update parquet version to 1.13.1 to fix PARQUET-2052 quantcast/kafka-connect-hdfs#7

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

sunchao commented May 21, 2021 •

edited

Loading

sunchao commented May 21, 2021 •

edited

Loading

gszadovszky left a comment

gszadovszky May 21, 2021 •

edited

Loading

sunchao May 21, 2021

shangxinli May 24, 2021

sunchao May 24, 2021

sunchao commented May 22, 2021

advancedxy commented May 23, 2021

sunchao commented May 23, 2021

shangxinli commented May 24, 2021

gszadovszky left a comment

sunchao commented May 25, 2021

gszadovszky commented May 26, 2021

PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910

Conversation

sunchao commented May 21, 2021 • edited Loading

Jira

Tests

Commits

Documentation

sunchao commented May 21, 2021 • edited Loading

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky May 21, 2021 • edited Loading

Choose a reason for hiding this comment

sunchao May 21, 2021

Choose a reason for hiding this comment

shangxinli May 24, 2021

Choose a reason for hiding this comment

sunchao May 24, 2021

Choose a reason for hiding this comment

sunchao commented May 22, 2021

advancedxy commented May 23, 2021

sunchao commented May 23, 2021

shangxinli commented May 24, 2021

gszadovszky left a comment

Choose a reason for hiding this comment

sunchao commented May 25, 2021

gszadovszky commented May 26, 2021

sunchao commented May 21, 2021 •

edited

Loading

sunchao commented May 21, 2021 •

edited

Loading

gszadovszky May 21, 2021 •

edited

Loading