-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding #910
Conversation
@gszadovszky @shangxinli @ggershinsky could you take a look? thanks! The CI failed cuz variable type changed - not sure why this should be disallowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one note about the downcast. Please check.
About the failures. This is indeed an incompatible change since it is done on a protected field in a public class. I can see two options to workaround this.
- Keep the field as an
int
and check the overflow afterwards at any places wheredictionaryByteSize
is used. I don't really like this idea myself. It might be risky and definitely not clean but it would not trigger the compatibility checker. - Exclude the related field from compatibility checker. I don't have too much experience on this but there are examples in the doc of the plugin. If you choose this solution the proper comments for the exclusion (in the pom.xml) would be required. It would also be nice to mention that after a minor release we shall remove that exclusion.
@@ -173,7 +173,7 @@ public BytesInput getBytes() { | |||
BytesInput bytes = concat(BytesInput.from(bytesHeader), rleEncodedBytes); | |||
// remember size of dictionary when we last wrote a page | |||
lastUsedDictionarySize = getDictionarySize(); | |||
lastUsedDictionaryByteSize = dictionaryByteSize; | |||
lastUsedDictionaryByteSize = (int) dictionaryByteSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should not be called in case of shouldFallBack()
returns true but we should be on the safe side and check the potential overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Changed to Math.toIntExact
so it will throw exception when overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that user continues writing that large string, will here throw an exception? And it will block the writer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. With the fix, it should've already fallback to PLAIN encoding during writeBytes
before getting here.
@gszadovszky I took your second recommendation and disabled the compatibility check for the field. Also added a TODO there to remove it once next minor release is done. Could you take another look? thanks. |
@sunchao Hi, do you encounter this in production? |
@advancedxy yes, one of our users wrote very large strings and ended up with corrupted Parquet files because of this issue. |
@gszadovszky Do you think we should fall back to no dictionary encoding in case of large byte size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @sunchao for your efforts!
@shangxinli, this case of large byte size is an overflow meaning the dictionary size would be larger than it could be represented as an int
. Meanwhile in the format page header both compressed and uncompressed sizes are represented as i32
values so the current dictionary would be out of its limits anyway.
It is another question if all the encoders used for writing data pages for binary values are prepared for similar situations. I would leave this question for @sunchao since this issue seems a rare situation (not aware of similar jiras while parquet-mr works like this since the beginning) .
Thanks @gszadovszky and @shangxinli for taking a look.
After fallback from dictionary encoding, I think it will fail later during writing data page. I think this is still better than generating corrupted page. |
I agree, @sunchao that failing with a proper exception is always better than committing a corrupt file. Any issues might happen after the fallback should be handled in a separate jira. |
* 'master' of https://github.com/apache/parquet-mr: (222 commits) PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding (apache#910) PARQUET-2041: Add zstd to `parquet.compression` description of ParquetOutputFormat Javadoc (apache#899) PARQUET-2050: Expose repetition & definition level from ColumnIO (apache#908) PARQUET-1761: Lower Logging Level in ParquetOutputFormat (apache#745) PARQUET-2046: Upgrade Apache POM to 23 (apache#904) PARQUET-2048: Deprecate BaseRecordReader (apache#906) PARQUET-1922: Deprecate IOExceptionUtils (apache#825) PARQUET-2037: Write INT96 with parquet-avro (apache#901) PARQUET-2044: Enable ZSTD buffer pool by default (apache#903) PARQUET-2038: Upgrade Jackson version used in parquet encryption. (apache#898) Revert "[WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894)" PARQUET-2027: Fix calculating directory offset for merge (apache#896) [WIP] Refactor GroupReadSupport to unuse deprecated api (apache#894) PARQUET-2030: Expose page size row check configurations to ParquetWriter.Builder (apache#895) PARQUET-2031: Upgrade to parquet-format 2.9.0 (apache#897) PARQUET-1448: Review of ParquetFileReader (apache#892) PARQUET-2020: Remove deprecated modules (apache#888) PARQUET-2025: Update Snappy version to 1.1.8.3 (apache#893) PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` (apache#889) PARQUET-1982: Random access to row groups in ParquetFileReader (apache#871) ... # Conflicts: # parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java # parquet-hadoop/pom.xml # parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java # parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation