-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error #14941
Conversation
…shouldn't throw an error
cc @davies |
LGTM, pending jenkins. |
@sameeragarwal: Do you expect any performace impact of this commit? It's an additional |
@heroldus decodeDictionaryIds() is only used when a batch across pages with different encoding (dictionary or plain), so it's not in the hot pass, I think the performance impact should be fine. |
@davies Fine, thx. |
Test build #64870 has finished for PR 14941 at commit
|
Merging this into master and 2.0 branch, thanks! |
… row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <[email protected]> Closes #14941 from sameeragarwal/parquet-exception-2. (cherry picked from commit a2c9acb) Signed-off-by: Davies Liu <[email protected]>
… row groups shouldn't throw an error This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <[email protected]> Closes apache#14941 from sameeragarwal/parquet-exception-2.
…consecutive row groups shouldn't throw an error ## What changes were proposed in this pull request? Backports #14941 in 2.0. This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameeragcs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2. Author: Sameer Agarwal <[email protected]> Closes #14944 from sameeragarwal/branch-2.0.
What changes were proposed in this pull request?
This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.
How was this patch tested?
Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!