Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce Iceberg table type using metadata file #758

Merged
merged 5 commits into from
Jan 8, 2025

Conversation

jacques-n
Copy link
Contributor

Adds Iceberg table type and first sub-variety, reading manifest files directly.

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@jacques-n jacques-n changed the title feat: Introduce Iceberg table type using metadata file feat: introduce Iceberg table type using metadata file Dec 20, 2024
@jacques-n jacques-n requested a review from rymurr December 20, 2024 02:18
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
westonpace
westonpace previously approved these changes Dec 20, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fun to see. Guess I should add a LanceTable soon 😆

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved
@jacques-n
Copy link
Contributor Author

Ping @westonpace @vbarua @EpsilonPrime for approval. Would love to get this merged and pulled through to duckdb.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EpsilonPrime and I discussed this a little offline. One thing that he noted was that Gluten had done similar work on adapting Substrait for Iceberg. However, it appears their work was on adding relations and details needed to create "post-Iceberg, planned" queries (I doubt this is the correct phrasing). Primarily this seems to be adding the concept of deletion files to the read relation (which I'm not sure is the way I would prefer those to be handled).

In other words, this relation (the new read_type) is very abstract and would represent the input to some kind of Iceberg-aware query engine or planner. It would then get transformed into a query that any old query engine could evaluate by some kind of Iceberg planner (or just directly evaluated).

I'm not sure any action needs to be taken, I don't think anything in here is necessarily misleading or vague, I'm just summarizing the offline conversation.

Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the differences between this implementation and Gluten's. As @westonpace mentioned above, the Gluten format is more physical, having the capability of specifying files and portions of files to ignore. Ignoring content could be evolved into a generic feature at some point. This change doesn't prevent something like Gluten's iceberg solution to be included so this logical iceberg table type works for me.

One thing I didn't like about the Gluten implementation was the iceberg file format referencing other file formats. When the physical Iceberg implementation is added I'd like to see it done as part of IcebergTable.


// snapshot options. if none set, uses the current snapshot listed in the metadata file
oneof snapshot {
// the snapshot id to read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Other sections capitalize the first letter of a comment.

@jacques-n
Copy link
Contributor Author

Agree with @westonpace and @EpsilonPrime comments here. Gluten focuses on physical iceberg reading (post scan planning) so a very different thing than what is done here. Agree they are compatible and we should try to keep that in mind as we progress.

@jacques-n jacques-n merged commit 7434e2f into substrait-io:main Jan 8, 2025
13 checks passed
@jacques-n jacques-n deleted the iceberg_table branch January 8, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants