-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet complex type handling #9187
Conversation
Hi @jaystarshot! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
df0987a
to
2f420cc
Compare
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
int16_t repeatedAncestor = maxDefine_; | ||
auto node = this; | ||
do { | ||
if (node->isOptional_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly based on https://github.com/apache/arrow/blob/main/cpp/src/parquet/level_conversion.h#L125.
the do() is to make sure that for nested types we are going inside and checking optional since the child layers optional (which we ignored) is needed by arrow
@@ -309,6 +318,9 @@ std::shared_ptr<const ParquetTypeWithId> ReaderBase::getParquetColumnInfo( | |||
VELOX_CHECK_EQ(children.size(), 1); | |||
const auto& child = children[0]; | |||
auto grandChildren = child->getChildren(); | |||
// This level will not be repeated since parquet will have a child layer with the repeated info which we are ignoring here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you elaborate more? it's a bit confusing that quote "This level will not be repeated" and next line isRepeated = true
3dc0bf9
to
0d6b1b8
Compare
@@ -342,7 +341,7 @@ TEST_F(ParquetTableScanTest, DISABLED_optArrayReqEle) { | |||
|
|||
// Required array with required elements. | |||
// Core dump is fixed, but the result is incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment about result being incorrect now that it is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure all related DISABLED_ testes are reenabled. Also, in case we want a UT in ParquetReaderTest.cpp as well is a unit test here https://github.com/facebookincubator/velox/pull/8425/files
90a4d38
to
7d4acb3
Compare
@jaystarshot Try debug build, it seems the OSS one is only running parquet tests with only release build |
Couldn't repro locally with the debug build as well |
Maybe add a int64_t ReaderBase::rowGroupUncompressedSize(
int32_t rowGroupIndex,
const dwio::common::TypeWithId& type) const {
if (type.column() != ParquetTypeWithId::kNonLeaf) {
VELOX_CHECK_LT(type.column(), fileMetaData_->row_groups[rowGroupIndex].columns.size());
return fileMetaData_->row_groups[rowGroupIndex]
.columns[type.column()]
.meta_data.total_uncompressed_size;
}
int64_t sum = 0;
for (auto child : type.getChildren()) {
sum += rowGroupUncompressedSize(rowGroupIndex, *child);
}
return sum;
} Apparently the type has more columns than the metadata has |
Was the array test the only failure? If so it could be that there could be some issue in the test file and I can remove this test. |
@jaystarshot It's both |
I am wondering if this could be related to the crash #9223 is fixing. |
We can either keep those tests disabled in this PR or cherry pick that commit in this same PR and try to merge both |
eeecedf
to
9e2670f
Compare
@Yuhta Can you please try with this? |
@jaystarshot I was about to add unit test for #9223 . Let me know whether I should proceed. |
Lets check if this fixes the crashes observed in meta's environment, if they do then we may not need tests |
Can you rebase to latest to fix the build? |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Co-authored-by: hitarth <[email protected]>
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Fixes facebookincubator#7776 Parquet has notion of optional and repeated layers which is needed in arrow calls like [DefLevelsToBitmap](https://github.com/facebookincubator/velox/blob/7fc09667d5e22c684fdeff81da529b79cc974fee/velox/dwio/parquet/reader/PageReader.cpp#L573). This info is passed using arrow:LevelInfo. We were incorrectly computing **repeatedAncestor** by ignoring optional fields which is fixed in this PR. Parquet has 3 level structure for nested types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists ``` // List<String> (list non-null, elements nullable) 1. required group my_list (LIST) { 2. repeated group list { 3. optional binary element (UTF8); } } ``` However when we read this and convert to **ParquetTypeWithId** in current velox parquet reader, we ignore the intermediated layer 2. **repeated group list** (grandfather logic) in https://github.com/facebookincubator/velox/pull/9187/files#diff-64787e76c1b0ad12b5764770a94acd62054896a762ccead8f083a71a060f2f44R325. Pull Request resolved: facebookincubator#9187 Reviewed By: mbasmanova Differential Revision: D55975472 Pulled By: Yuhta fbshipit-source-id: d0972b3134cc710645a9f50cd74a23efac830751
Fixes #7776
Parquet has notion of optional and repeated layers which is needed in arrow calls like DefLevelsToBitmap.
This info is passed using arrow:LevelInfo. We were incorrectly computing repeatedAncestor by ignoring optional fields which is fixed in this PR.
Parquet has 3 level structure for nested types
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
However when we read this and convert to ParquetTypeWithId in current velox parquet reader, we ignore the intermediated layer 2. repeated group list (grandfather logic) in https://github.com/facebookincubator/velox/pull/9187/files#diff-64787e76c1b0ad12b5764770a94acd62054896a762ccead8f083a71a060f2f44R325.