-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Fix zero bit wide bit field #2879
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6df6c87
to
68d7401
Compare
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oerling Orri, thank you for the fix. The fix itself looks good, but the test seems to be copy-pasting a lot of existing code. It would be nice to refactor to avoid copy-paste.
CC: @majetideepak
@@ -2570,6 +2570,11 @@ void IntDecoder<isSigned>::decodeBitsLE( | |||
const char* bufferEnd, | |||
T* FOLLY_NONNULL result) { | |||
uint64_t mask = bits::lowMask(bitWidth); | |||
if (bitWidth == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is shared between Parquet and DWRF, right? If so, is this condition valid for DWRF? If not, should we throw if it happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably check bitWidth
before calling this and throw if it is not valid in dwrf. It is not easy to differentiate them inside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this method is called from a single place in a Parquet reader: velox/dwio/parquet/reader/RleDecoder.h. If that's the case, it is a bit confusing to have it in the "common" IntDecoder.
super::decodeBitsLE(
reinterpret_cast<const uint64_t*>(super::bufferStart),
bitOffset_,
folly::Range<const int32_t*>(rows + rowIndex, numRows),
currentRow,
bitWidth_,
super::bufferEnd,
reinterpret_cast<TIndex*>(values) + numValues);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is alpha will use that so it's put here for future proofing.
"velox/dwio/parquet/tests/reader", "../examples/" + fileName); | ||
} | ||
|
||
std::shared_ptr<Task> assertQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of code in this test is a copy-paste from velox/dwio/parquet/tests/ParquetTpchTestBase.h
It would be nice to refactor to avoid copy-paste.
CC: @majetideepak
"create table store as select * from read_parquet('{}')", | ||
getExampleFilePath("store.snappy.parquet"))); | ||
|
||
auto result = duckDb_->execute("select s_store_sk, s_state from store"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems redundant.
@jinchengchenghh Would you verify that this fix addresses the issue you are seeing? |
The parquet query test test is entirely optional. The code path is tested by just including a 0 wide case in the decode bits test.
ORC V2 does not seem to have 0 bit wide fields. Can’t say about Alpha. On balance, a zero represented by 0 bits is sort of consistent.
We’d generally prefer not to check in files, so can delete the whole parquet query test. LMK what you prefer.
Orri
|
DWRF does not have bit fields. ORC V2 does but nbnot zero-wide. There is a switch in ORC V2 for a width enum which produces a range check. Alpha is uncertain.
|
@oerling If you think the test does not add value, then let's remove it. |
Verified local, it works well. Thank you very much! |
68d7401
to
495da39
Compare
Parquet has been seen to encode a constant column as a dictionary with a zero bit index field, i.e. all indices are 0. Recognizes this in bit unpacking. Fix E2EFilterTestBase to make constant strings if cardinality is 1. Also, the addOneOffs parameter of makeStringDistribution did not have any effect but now adds outliers as intended. Fix calling value hook on row group dictionary in DWRF. Exposed by the fix in makeStringDistribution.
495da39
to
f515d47
Compare
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fix zero bit wide bit field
Parquet has been seen to encode a constant column as a dictionary with
a zero bit index field, i.e. all indices are 0. Recognizes this in bit
unpacking.
Fix E2EFilterTestBase to make constant strings if cardinality is
have any effect but now adds outliers as intended.
Fix calling value hook on row group dictionary in DWRF. Exposed by the
fix in makeStringDistribution.