Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Fix zero bit wide bit field #2879

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions velox/dwio/common/ColumnVisitors.h
Original file line number Diff line number Diff line change
Expand Up @@ -1347,8 +1347,8 @@ class ExtractStringDictionaryToGenericHook {
hook_->addValue(rowIndex, &view);
} else {
VELOX_DCHECK(state_.inDictionary);
auto view = folly::StringPiece(
reinterpret_cast<const StringView*>(state_.dictionary.values)[value]);
auto view = folly::StringPiece(reinterpret_cast<const StringView*>(
state_.dictionary2.values)[value - dictionarySize()]);
hook_->addValue(rowIndex, &view);
}
}
Expand Down
5 changes: 5 additions & 0 deletions velox/dwio/common/IntDecoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2570,6 +2570,11 @@ void IntDecoder<isSigned>::decodeBitsLE(
const char* bufferEnd,
T* FOLLY_NONNULL result) {
uint64_t mask = bits::lowMask(bitWidth);
if (bitWidth == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is shared between Parquet and DWRF, right? If so, is this condition valid for DWRF? If not, should we throw if it happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check bitWidth before calling this and throw if it is not valid in dwrf. It is not easy to differentiate them inside the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this method is called from a single place in a Parquet reader: velox/dwio/parquet/reader/RleDecoder.h. If that's the case, it is a bit confusing to have it in the "common" IntDecoder.

    super::decodeBitsLE(
        reinterpret_cast<const uint64_t*>(super::bufferStart),
        bitOffset_,
        folly::Range<const int32_t*>(rows + rowIndex, numRows),
        currentRow,
        bitWidth_,
        super::bufferEnd,
        reinterpret_cast<TIndex*>(values) + numValues);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is alpha will use that so it's put here for future proofing.

// A column of dictionary indices can be 0 bits wide if all indices are 0.
memset(result, 0, rows.size() * sizeof(T));
return;
}
// We subtract rowBias * bitWidth bits from the starting position.
bitOffset -= rowBias * bitWidth;
if (bitOffset < 0) {
Expand Down
2 changes: 1 addition & 1 deletion velox/dwio/common/tests/DecodeBitsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ class DecodeBitsTest : public testing::Test {
};

TEST_F(DecodeBitsTest, allWidths) {
for (auto width = 1; width < bitPackedData_.size(); ++width) {
for (auto width = 0; width < bitPackedData_.size(); ++width) {
testDecodeRows<int32_t>(width, allRows_);
testDecodeRows<int64_t>(width, allRows_);
testDecodeRows<int32_t>(width, oddRows_);
Expand Down
9 changes: 6 additions & 3 deletions velox/dwio/common/tests/E2EFilterTestBase.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -110,17 +110,20 @@ void E2EFilterTestBase::makeStringDistribution(
continue;
}
std::string value;
if (counter % 100 < cardinality) {
if (counter % 2251 < 100 || cardinality == 1) {
// Run of 100 ascending values every 2251 rows. If cardinality is 1, the
// value is repeated here.
value = fmt::format("s{}", counter % cardinality);
strings->set(row, StringView(value));
} else if (counter % 100 > 90 && row > 0) {
strings->copy(strings, row - 1, row, 1);
// Sequence of 10 identical values every 100 rows.
strings->copy(strings, row, row - 1, 1);
} else if (addOneOffs && counter % 234 == 0) {
value = fmt::format(
"s{}",
folly::Random::rand32(filterGenerator->rng()) %
(111 * cardinality));

strings->set(row, StringView(value));
} else {
value = fmt::format(
"s{}", folly::Random::rand32(filterGenerator->rng()) % cardinality);
Expand Down
4 changes: 3 additions & 1 deletion velox/dwio/parquet/tests/reader/E2EFilterTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -345,10 +345,12 @@ TEST_F(E2EFilterTest, stringDirect) {
TEST_F(E2EFilterTest, stringDictionary) {
testWithTypes(
"string_val:string,"
"string_val_2:string",
"string_val_2:string,"
"string_const: string",
[&]() {
makeStringDistribution(Subfield("string_val"), 100, true, false);
makeStringDistribution(Subfield("string_val_2"), 170, false, true);
makeStringDistribution(Subfield("string_const"), 1, true, false);
},
false,
{"string_val", "string_val_2"},
Expand Down