Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL RFC] Column Mapping Usage Tracking #2683

Merged
merged 6 commits into from
Mar 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions protocol_rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
| Date proposed | RFC file | Github issue | RFC title |
|:-|:-|:-|:-|
| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps |
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |

### Accepted RFCs

Expand Down
43 changes: 43 additions & 0 deletions protocol_rfcs/column-mapping-usage-tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Column Mapping Usage Tracking
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2682**

This RFC proposes an extension for Column Mapping to track where columns have been dropped or renamed during the history of a table.
This allows using the (logical) name of a column as the physical name of a column, while still ensuring that all physical names are unique.
This helps with the disablement of Column Mapping proposed in [#2481](https://github.com/delta-io/delta/issues/2481), as in this case it is no longer required to rewrite the table, and it simply suffices to change the mode to none.

--------

> New subsection at the end of the `Column Mapping` section

## Usage Tracking

Column Mapping Usage Tracking is an extension of the column mapping feature that allows Delta to track whether a column has been dropped or renamed.
This is tracked by the table property `delta.columnMapping.hasDroppedOrRenamed`. This table property is set to `false` when the table is created, and flipped to `true` when the first column is either dropped or renamed.
The writer table feature `columnMappingUsageTracking` is added to the `writerFeatures` in the `protocol` to ensure that all writers correctly track when columns are dropped or renamed.

--------

> Modification to the `Writer Requirements for Column Mapping` subsection

- Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.

**is replaced by**

- Assign a unique physical name to each column.
- When enabling column mapping on existing table, the physical name of the column must be set to the (logical) name of the column.
- If the feature `columnMappingUsageTracking` is supported, then when adding a new column to a table and `delta.columnMapping.hasDroppedOrRenamed` column property is `false` the (logical) name of the column should be used as the physical name.
- Otherwise the physical column must contain a universally unique identifier (UUID) to guarantee uniqueness.
- Assign a column id to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema.

--------

> New subsection at the end of the `Writer Requirements for Column Mapping` subsection

### Writer Requirements for Usage Tracking

In order to support column mapping usage tracking, writers must:
- Write `protocol` and `metaData` actions when Column Mapping Usage Tracking is turned on for the first time:
- Write a `protocol` action with writer version 7 and the feature `columnMappingUsageTracking` in the `writerFeatures`.
- Write a `metaData` actions with the table property `delta.columnMapping.hasDroppedOrRenamed` set to `false` when creating a new table, or set to `true` when enabling usage tracking on an existing table.
- When dropping or renaming a column `delta.columnMapping.hasDroppedOrRenamed` must be set to `true`.
- After `delta.columnMapping.hasDroppedOrRenamed` is set to `true` it must never be set back to `false` again.
Loading