Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Fix zero bit wide bit field #2879

Closed
wants to merge 1 commit into from

Conversation

oerling
Copy link
Contributor

@oerling oerling commented Oct 19, 2022

Fix zero bit wide bit field

Parquet has been seen to encode a constant column as a dictionary with
a zero bit index field, i.e. all indices are 0. Recognizes this in bit
unpacking.

Fix E2EFilterTestBase to make constant strings if cardinality is

  1. Also, the addOneOffs parameter of makeStringDistribution did not
    have any effect but now adds outliers as intended.

Fix calling value hook on row group dictionary in DWRF. Exposed by the
fix in makeStringDistribution.

@netlify
Copy link

netlify bot commented Oct 19, 2022

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f515d47
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6357067b3e3aa8000b389659

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 19, 2022
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova mbasmanova changed the title Fix zero bit wide bit field [Parquet] Fix zero bit wide bit field Oct 19, 2022
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oerling Orri, thank you for the fix. The fix itself looks good, but the test seems to be copy-pasting a lot of existing code. It would be nice to refactor to avoid copy-paste.

CC: @majetideepak

@@ -2570,6 +2570,11 @@ void IntDecoder<isSigned>::decodeBitsLE(
const char* bufferEnd,
T* FOLLY_NONNULL result) {
uint64_t mask = bits::lowMask(bitWidth);
if (bitWidth == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is shared between Parquet and DWRF, right? If so, is this condition valid for DWRF? If not, should we throw if it happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check bitWidth before calling this and throw if it is not valid in dwrf. It is not easy to differentiate them inside the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this method is called from a single place in a Parquet reader: velox/dwio/parquet/reader/RleDecoder.h. If that's the case, it is a bit confusing to have it in the "common" IntDecoder.

    super::decodeBitsLE(
        reinterpret_cast<const uint64_t*>(super::bufferStart),
        bitOffset_,
        folly::Range<const int32_t*>(rows + rowIndex, numRows),
        currentRow,
        bitWidth_,
        super::bufferEnd,
        reinterpret_cast<TIndex*>(values) + numValues);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is alpha will use that so it's put here for future proofing.

"velox/dwio/parquet/tests/reader", "../examples/" + fileName);
}

std::shared_ptr<Task> assertQuery(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of code in this test is a copy-paste from velox/dwio/parquet/tests/ParquetTpchTestBase.h

It would be nice to refactor to avoid copy-paste.

CC: @majetideepak

"create table store as select * from read_parquet('{}')",
getExampleFilePath("store.snappy.parquet")));

auto result = duckDb_->execute("select s_store_sk, s_state from store");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems redundant.

@mbasmanova
Copy link
Contributor

@jinchengchenghh Would you verify that this fix addresses the issue you are seeing?

@oerling
Copy link
Contributor Author

oerling commented Oct 19, 2022 via email

@oerling
Copy link
Contributor Author

oerling commented Oct 19, 2022 via email

@mbasmanova
Copy link
Contributor

@oerling If you think the test does not add value, then let's remove it.

@jinchengchenghh
Copy link
Contributor

Verified local, it works well. Thank you very much!

lviiii added a commit to lviiii/velox that referenced this pull request Oct 24, 2022
Parquet has been seen to encode a constant column as a dictionary with
a zero bit index field, i.e. all indices are 0. Recognizes this in bit
unpacking.

Fix E2EFilterTestBase to make constant strings if cardinality is
1. Also, the addOneOffs parameter of makeStringDistribution did not
have any effect but now adds outliers as intended.

Fix calling value hook on row group dictionary in DWRF. Exposed by the
fix in makeStringDistribution.
@oerling oerling requested a review from Yuhta October 24, 2022 21:42
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants