Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Document Snapshot Summary Optional Fields for Standardization #11660

Merged
merged 11 commits into from
Jan 24, 2025

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Nov 26, 2024

This PR introduces a new section, "Optional Snapshot Summary Fields", in the table spec under Appendix F to document optional fields in the snapshot summary, including metrics, and other fields such as Write-Audit-Publish (WAP)-related fields. The goal is to establish a clear standard for these fields, ensuring consistent naming and usage across implementations while reducing ambiguity and improving compatibility.

Proposal Here: https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Nov 26, 2024
@HonahX HonahX linked an issue Nov 26, 2024 that may be closed by this pull request
6 tasks
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer Thanks for reviewing! I moved the whole section to Appendix G and update field descriptions accordingly

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
@HonahX HonahX marked this pull request as ready for review January 3, 2025 01:06
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this new appendix section

@RussellSpitzer
Copy link
Member

@HonahX I think we are good on this one, I don't think we had any negative feedback from the dev list. So I think we are good to merge if you are ready.

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
| **`deleted-duplicate-files`** | Number of duplicate files deleted (duplicates are files recorded more than once in the manifest) |
| **`changed-partition-count`** | Number of partitions with files added or removed in the snapshot |

### Other Fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other fields also include engine-name, engine-version, app-id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the edge of including these since they are not constants in SnapshotSummary.java. engine-name and engine-version seem fine as they’re unambiguous and included in the view's summary spec: https://iceberg.apache.org/view-spec/#summary. However, app-id feels too generic and poorly defined, so I’ve left it out.

I've added rows for engine-name and engine-version. Please let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark adds the spark.app.id to the snapshot summary as seen here
https://iceberg.apache.org/docs/1.6.0/spark-queries/#snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both spark.app.id and app-id can exist in snapshot summary when using spark:

{spark.app.id=local-1737052579245, 
...
engine-version=3.5.2, 
app-id=local-1737052579245, 
engine-name=spark, 
...

I personally prefer spark.app.id since it clearly indicates that this field relates to Spark. I think it’s best to let each engine decide what engine-specific information to include in the summary and how to name it, rather than setting a general recommendation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, We should skip anything that's engine specific since the goal is to provide definitions for things that are universal and may be used by different engines. So we don't need spark.app.id here because no other engine needs a consistent definition of that.

add engine-name and engine-version
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor

@sfc-gh-aixu sfc-gh-aixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HonahX
Copy link
Contributor Author

HonahX commented Jan 24, 2025

Thanks @kevinjqliu @RussellSpitzer @sfc-gh-aixu @danielcweeks @rdblue for reviewing!

The vote has passed: https://lists.apache.org/thread/mz01jwt69osqxhx9d3dd9xzncv9yncd0

I will go ahead and merge it.

@HonahX HonahX merged commit c0c1b15 into apache:main Jan 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Snapshot Summary Optional Fields for Standardization
7 participants