Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix infinite loop in not fully packed bit-packed runs #1555

Merged
merged 3 commits into from
Apr 15, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 13, 2022

Which issue does this PR close?

Closes #1458

Cherry-picked from #1460, will add test and then mark ready for review

Rationale for this change

See ticket

What changes are included in this PR?

See ticket

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2022
@tustvold tustvold marked this pull request as ready for review April 15, 2022 12:59
@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

fyi @anliakho2

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am not super familiar with the code, I did review the test and verified it fails without this code change.

Also the code change makes sense to me.

Also I looked for other places that get_batch is called. I found one in the DeltaBitPackedEncoder that maybe has a similar problem:

https://github.com/tustvold/arrow-rs/blob/fix-rle-infinite-loop/parquet/src/encodings/decoding.rs#L653

It also appears to be used a few other times.

/Users/alamb/Software/arrow-rs/parquet/src/column/reader/decoder.rs
260:                 Ok(reader.get_batch::<i16>(&mut out[range], *bit_width as usize))
/Users/alamb/Software/arrow-rs/parquet/src/data_type.rs
696:             let values_read = bit_reader.get_batch(&mut buffer[..num_values], 1);
/Users/alamb/Software/arrow-rs/parquet/src/encodings/decoding.rs
655:                 .get_batch(&mut buffer[read..read + batch_to_read], bit_width);
/Users/alamb/Software/arrow-rs/parquet/src/encodings/levels.rs
271:                     decoder.get_batch::<i16>(&mut buffer[..len], bit_width as usize);
/Users/alamb/Software/arrow-rs/parquet/src/encodings/rle.rs
418:                 num_values = bit_reader.get_batch::<T>(
471:                     num_values = bit_reader.get_batch::<i32>(
/Users/alamb/Software/arrow-rs/parquet/src/util/bit_util.rs
1062:         let values_read = reader.get_batch::<T>(&mut batch, num_bits);

@@ -743,6 +753,42 @@ mod tests {
}
}

#[test]
fn test_truncated_rle() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this test times out without the code in this PR 👍 which is a good sign to me that it covers the issue

test encodings::rle::tests::test_truncated_rle has been running for over 60 seconds

(I also tried removing each of the two cases -- dict and non dict and the test hung in both)

@tustvold
Copy link
Contributor Author

I found one in the DeltaBitPackedEncoder that maybe has a similar problem:

In this case the specification is very clear that the miniblock can't be truncated

If there are not enough values to fill the last miniblock, we pad the miniblock so that its length is always the number of values in a full miniblock multiplied by the bit width. The values of the padding bits should be zero, but readers must accept paddings consisting of arbitrary bits as well.

And we return an error a line below if it is, so I think we should be ok.

Good shout to check though 👍

The other cases don't appear to have loops, and GenericColumnReader which drives them will bail out if a page is truncated

@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

I plan to merge this after all (non miri) tests pass

@alamb alamb merged commit 68be0f1 into apache:master Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hang due to infinite loop when reading some parquet files with RLE encoding and bit packing
3 participants