-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce Iceberg table type using metadata file #758
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fun to see. Guess I should add a LanceTable
soon 😆
Ping @westonpace @vbarua @EpsilonPrime for approval. Would love to get this merged and pulled through to duckdb. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EpsilonPrime and I discussed this a little offline. One thing that he noted was that Gluten had done similar work on adapting Substrait for Iceberg. However, it appears their work was on adding relations and details needed to create "post-Iceberg, planned" queries (I doubt this is the correct phrasing). Primarily this seems to be adding the concept of deletion files to the read relation (which I'm not sure is the way I would prefer those to be handled).
In other words, this relation (the new read_type
) is very abstract and would represent the input to some kind of Iceberg-aware query engine or planner. It would then get transformed into a query that any old query engine could evaluate by some kind of Iceberg planner (or just directly evaluated).
I'm not sure any action needs to be taken, I don't think anything in here is necessarily misleading or vague, I'm just summarizing the offline conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the differences between this implementation and Gluten's. As @westonpace mentioned above, the Gluten format is more physical, having the capability of specifying files and portions of files to ignore. Ignoring content could be evolved into a generic feature at some point. This change doesn't prevent something like Gluten's iceberg solution to be included so this logical iceberg table type works for me.
One thing I didn't like about the Gluten implementation was the iceberg file format referencing other file formats. When the physical Iceberg implementation is added I'd like to see it done as part of IcebergTable.
|
||
// snapshot options. if none set, uses the current snapshot listed in the metadata file | ||
oneof snapshot { | ||
// the snapshot id to read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Other sections capitalize the first letter of a comment.
Agree with @westonpace and @EpsilonPrime comments here. Gluten focuses on physical iceberg reading (post scan planning) so a very different thing than what is done here. Agree they are compatible and we should try to keep that in mind as we progress. |
Adds Iceberg table type and first sub-variety, reading manifest files directly.