-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Document Snapshot Summary Optional Fields for Standardization #11660
Spec: Document Snapshot Summary Optional Fields for Standardization #11660
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this new appendix section
@HonahX I think we are good on this one, I don't think we had any negative feedback from the dev list. So I think we are good to merge if you are ready. |
format/spec.md
Outdated
| **`deleted-duplicate-files`** | Number of duplicate files deleted (duplicates are files recorded more than once in the manifest) | | ||
| **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot | | ||
|
||
### Other Fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other fields also include engine-name
, engine-version
, app-id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the edge of including these since they are not constants in SnapshotSummary.java
. engine-name
and engine-version
seem fine as they’re unambiguous and included in the view's summary spec: https://iceberg.apache.org/view-spec/#summary. However, app-id
feels too generic and poorly defined, so I’ve left it out.
I've added rows for engine-name
and engine-version
. Please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark adds the spark.app.id
to the snapshot summary as seen here
https://iceberg.apache.org/docs/1.6.0/spark-queries/#snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both spark.app.id
and app-id
can exist in snapshot summary when using spark:
{spark.app.id=local-1737052579245,
...
engine-version=3.5.2,
app-id=local-1737052579245,
engine-name=spark,
...
I personally prefer spark.app.id
since it clearly indicates that this field relates to Spark. I think it’s best to let each engine decide what engine-specific information to include in the summary and how to name it, rather than setting a general recommendation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, We should skip anything that's engine specific since the goal is to provide definitions for things that are universal and may be used by different engines. So we don't need spark.app.id here because no other engine needs a consistent definition of that.
add engine-name and engine-version
# Conflicts: # format/spec.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @kevinjqliu @RussellSpitzer @sfc-gh-aixu @danielcweeks @rdblue for reviewing! The vote has passed: https://lists.apache.org/thread/mz01jwt69osqxhx9d3dd9xzncv9yncd0 I will go ahead and merge it. |
This PR introduces a new section, "Optional Snapshot Summary Fields", in the table spec under Appendix F to document optional fields in the snapshot summary, including metrics, and other fields such as Write-Audit-Publish (WAP)-related fields. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.
Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing