feat: introduce Iceberg table type using metadata file #758

jacques-n · 2024-12-19T20:00:37Z

Adds Iceberg table type and first sub-variety, reading manifest files directly.

github-actions · 2024-12-19T20:00:58Z

ACTION NEEDED

Substrait follows the Conventional Commits
specification for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

proto/substrait/algebra.proto

site/docs/relations/logical_relations.md

proto/substrait/algebra.proto

westonpace

This is fun to see. Guess I should add a LanceTable soon 😆

site/docs/relations/logical_relations.md

jacques-n · 2025-01-06T18:43:34Z

Ping @westonpace @vbarua @EpsilonPrime for approval. Would love to get this merged and pulled through to duckdb.

westonpace

@EpsilonPrime and I discussed this a little offline. One thing that he noted was that Gluten had done similar work on adapting Substrait for Iceberg. However, it appears their work was on adding relations and details needed to create "post-Iceberg, planned" queries (I doubt this is the correct phrasing). Primarily this seems to be adding the concept of deletion files to the read relation (which I'm not sure is the way I would prefer those to be handled).

In other words, this relation (the new read_type) is very abstract and would represent the input to some kind of Iceberg-aware query engine or planner. It would then get transformed into a query that any old query engine could evaluate by some kind of Iceberg planner (or just directly evaluated).

I'm not sure any action needs to be taken, I don't think anything in here is necessarily misleading or vague, I'm just summarizing the offline conversation.

EpsilonPrime

I've reviewed the differences between this implementation and Gluten's. As @westonpace mentioned above, the Gluten format is more physical, having the capability of specifying files and portions of files to ignore. Ignoring content could be evolved into a generic feature at some point. This change doesn't prevent something like Gluten's iceberg solution to be included so this logical iceberg table type works for me.

One thing I didn't like about the Gluten implementation was the iceberg file format referencing other file formats. When the physical Iceberg implementation is added I'd like to see it done as part of IcebergTable.

EpsilonPrime · 2025-01-03T03:22:25Z

proto/substrait/algebra.proto

+
+      // snapshot options. if none set, uses the current snapshot listed in the metadata file
+      oneof snapshot {
+        // the snapshot id to read.


Nit: Other sections capitalize the first letter of a comment.

jacques-n · 2025-01-08T00:03:52Z

Agree with @westonpace and @EpsilonPrime comments here. Gluten focuses on physical iceberg reading (post scan planning) so a very different thing than what is done here. Agree they are compatible and we should try to keep that in mind as we progress.

feat: Introduce Iceberg table type using metadata files

90f46cc

jacques-n requested review from cpcloud, westonpace, EpsilonPrime and vbarua as code owners December 19, 2024 20:00

jacques-n requested a review from rdblue December 19, 2024 20:00

Fix protobuf formatting

739179d

jacques-n changed the title ~~feat: Introduce Iceberg table type using metadata file~~ feat: introduce Iceberg table type using metadata file Dec 20, 2024

jacques-n requested a review from rymurr December 20, 2024 02:18

rymurr reviewed Dec 20, 2024

View reviewed changes

proto/substrait/algebra.proto Outdated Show resolved Hide resolved

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

proto/substrait/algebra.proto Show resolved Hide resolved

westonpace previously approved these changes Dec 20, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

Address review comments.

1381a5e

jacques-n dismissed westonpace’s stale review via 1381a5e January 2, 2025 21:56

jacques-n added 2 commits January 2, 2025 11:58

apply buf format.

e511e86

update timestamp format

10682c6

westonpace approved these changes Jan 6, 2025

View reviewed changes

EpsilonPrime approved these changes Jan 6, 2025

View reviewed changes

jacques-n merged commit 7434e2f into substrait-io:main Jan 8, 2025
13 checks passed

jacques-n deleted the iceberg_table branch January 8, 2025 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce Iceberg table type using metadata file #758

feat: introduce Iceberg table type using metadata file #758

jacques-n commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

westonpace left a comment

jacques-n commented Jan 6, 2025

westonpace left a comment

EpsilonPrime left a comment

EpsilonPrime Jan 3, 2025

jacques-n commented Jan 8, 2025

feat: introduce Iceberg table type using metadata file #758

feat: introduce Iceberg table type using metadata file #758

Conversation

jacques-n commented Dec 19, 2024

github-actions bot commented Dec 19, 2024

westonpace left a comment

Choose a reason for hiding this comment

jacques-n commented Jan 6, 2025

westonpace left a comment

Choose a reason for hiding this comment

EpsilonPrime left a comment

Choose a reason for hiding this comment

EpsilonPrime Jan 3, 2025

Choose a reason for hiding this comment

jacques-n commented Jan 8, 2025