-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose Dictionary to reader #1270
Comments
I'd be very interested in any details you can share about your particular use-case, in particular if there is anyway we might be able to combine efforts in this space. The proposal in #1191 is just that, and any input you'd be willing to provide would be most appreciated 👍 If you're using arrow, I'd also potentially draw your attention to #1180 which will preserve the dictionary encoding present in the parquet file for dictionary arrays, and is slated for inclusion as the default behaviour in arrow 9. |
@alamb @tustvold My use case is a proprietary analytical DB engine, it has its' own proprietary storage format but also allows running queries against external formats like Parquet. As it already has a highly optimized scan capability of dictionary encoded data, all I want is for it to have access to the raw Parquet dictionary. I don't want to take a dependency on Arrow array for that as I'm not using Arrow at all, I don't deserialize Parquet into Arrow since the engine I'm working on has its' own in-memory representation (I don't even build Arrow with Parquet) |
Thank you for taking the time to respond, I figured that might be the case, but thought it couldn't hurt to check. I'm sure you're aware, but just as a heads up if you're reading the data directly, the RLE encoding is not length preserving #1111 (comment), and a column chunk may not be consistently dictionary encoded (e.g. if the dictionary gets too large). FWIW there were some generics added in #1041 and evolved since to aid decoding columns to custom in-memory representations. They're currently crate-local, but that could be changed if you wished to use them. Just let me know 😀 |
@tustvold My engine assumes a column is either fully dictionary encoded or not, so for my use case I first have to scan the headers of all pages in the a column chunk to assert they're all dictionary encoded, if any of them are not (other than the dictionary page itself of course), I treat the column as not-dictionary encoded, meaning I'll read with a ColumnReader instead of a PageReader and let the library handle the variously encoded pages. |
Many Parquet query engines have optimizations that rely on Dictionary encoded columns, e.g. for selections with filter.
The Rust implementation of the Parquet reader makes it difficult for a reader to read dictionary encoded values because it doesn't expose the RLE decoder to the reader code, so a reader that wishes to work with dictionary values has to re-implement an RLE decoder to read values from dictionary encoded data pages.
This can be easily addressed by making the RLE code public outside the crate.
The text was updated successfully, but these errors were encountered: