-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent naming for Parquet page index structures #6097
Comments
cc @alamb |
Thank you for writing this ticket @etseidl. I think the naming of the various metadata structures in the parquet-rs crate to be quite confusing I believe the core of my confusion stems from the two sets of structures:
The current naming of the structures does not make the distinction clear and this I think the boundary gets muddled I think the best thing we can do is to pick a naming convention that makes this distinction clear |
What do you think about creating a Rust struct This would be more in line with how the existing RowGroupMetaDataand ColumnChunkMetaData represent the contents of thrift structures. So that might look like
Maybe we could do something similar to wrap the 🤔 cc @sunchao @adriangb @tustvold @XiangpengHao @Ted-Jiang @liukun4515 @viirya in case you have thoughts about the metadata structures |
I'm in favor. I like how the current |
+1. The current state of |
I don't have a strong opinion, I haven't spent as much time as the rest of you with these structs, but I do find them confusing and am in favor of changes to make it easier to understand 😃 Some top-level documentation of how these things interact could also be helpful. |
Perhaps we can steal the excellent description here. 😉 |
I could also commission some more monodraw diagrams 😆 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The Parquet Page Index consists of two thrift structures: a
ColumnIndex
containing per-page statistics useful for filtering, and anOffsetIndex
used to locate individual pages in the Parquet file. Size statistics were recently added to these structures (PARQUET-2261) and these are currently being added to arrow-rs (#5022). When #5022 is completed, the Parquet page index will be represented in theparquet
crate by theIndex
enum for the column index, and a newOffsetIndexMetaData
struct for the offset index (replacing the old offset index which simply returned thepage_locations
member of the thriftOffsetIndex
). Somewhat confusingly, theIndex
encapsulates a vector ofPageIndex
structs, which contain per-page column index information.Further,
ParquetMetaData
packages the page indexes for each row group into type aliased 2-D arrays:ParquetColumnIndex = Vec<Vec<Index>>
andParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>
.Describe the solution you'd like
It would be desirable to make the naming for all of these structures a bit more consistent, so their intended use is a bit more obvious. One possible mapping could be:
Index
toParquetColumnIndex
(and perhapsPageIndex
->ColumnIndexForPage
).OffsetIndexMetaData
toParquetOffsetIndex
.ParquetColumnIndex
->ParquetColumnIndexes
andParquetOffsetIndex
->ParquetOffsetIndexes
.Describe alternatives you've considered
Maintain the status quo, or some other better names.
Additional context
See the comments here and here for more context.
The text was updated successfully, but these errors were encountered: