From f0b0878c88294f79562c8d86efbeb8cc94c79270 Mon Sep 17 00:00:00 2001 From: Tom van Bussel Date: Fri, 29 Mar 2024 02:09:07 +0100 Subject: [PATCH] [PROTOCOL RFC] Column Mapping Usage Tracking (#2683) #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [x] Other (fill in here) ## Description Adds the proposal for spec change Column Mapping Usage Tracking (see https://github.com/delta-io/delta/issues/2682) to the RFC folder. ## How was this patch tested? N/A ## Does this PR introduce _any_ user-facing changes? N/A --- protocol_rfcs/README.md | 13 +++--- .../column-mapping-usage-tracking.md | 43 +++++++++++++++++++ 2 files changed, 50 insertions(+), 6 deletions(-) create mode 100644 protocol_rfcs/column-mapping-usage-tracking.md diff --git a/protocol_rfcs/README.md b/protocol_rfcs/README.md index fcf40e00063..1e2e7d4fed3 100644 --- a/protocol_rfcs/README.md +++ b/protocol_rfcs/README.md @@ -16,12 +16,13 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024, ### Proposed RFCs -| Date proposed | RFC file | Github issue | RFC title | -|:--------------|:-----------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:------------------------------| -| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps | -| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening | -| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits | -| 2023-02-28 | [vacuum-protocol-check.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/vacuum-protocol-check.md) | https://github.com/delta-io/delta/issues/2630 | Enforce Vacuum Protocol Check | +| Date proposed | RFC file | Github issue | RFC title | +|:--------------|:----------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:------------------------------| +| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps | +| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening | +| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits | +| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking | +| 2023-02-28 | [vacuum-protocol-check.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/vacuum-protocol-check.md) | https://github.com/delta-io/delta/issues/2630 | Enforce Vacuum Protocol Check | ### Accepted RFCs diff --git a/protocol_rfcs/column-mapping-usage-tracking.md b/protocol_rfcs/column-mapping-usage-tracking.md new file mode 100644 index 00000000000..692af3e300c --- /dev/null +++ b/protocol_rfcs/column-mapping-usage-tracking.md @@ -0,0 +1,43 @@ +# Column Mapping Usage Tracking +**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2682** + +This RFC proposes an extension for Column Mapping to track where columns have been dropped or renamed during the history of a table. +This allows using the (logical) name of a column as the physical name of a column, while still ensuring that all physical names are unique. +This helps with the disablement of Column Mapping proposed in [#2481](https://github.com/delta-io/delta/issues/2481), as in this case it is no longer required to rewrite the table, and it simply suffices to change the mode to none. + +-------- + +> New subsection at the end of the `Column Mapping` section + +## Usage Tracking + +Column Mapping Usage Tracking is an extension of the column mapping feature that allows Delta to track whether a column has been dropped or renamed. +This is tracked by the table property `delta.columnMapping.hasDroppedOrRenamed`. This table property is set to `false` when the table is created, and flipped to `true` when the first column is either dropped or renamed. +The writer table feature `columnMappingUsageTracking` is added to the `writerFeatures` in the `protocol` to ensure that all writers correctly track when columns are dropped or renamed. + +-------- + +> Modification to the `Writer Requirements for Column Mapping` subsection + +- Assign a globally unique identifier as the physical name for each new column that is added to the schema. This is especially important for supporting cheap column deletions in `name` mode. In addition, column identifiers need to be assigned to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema. + +**is replaced by** + +- Assign a unique physical name to each column. + - When enabling column mapping on existing table, the physical name of the column must be set to the (logical) name of the column. + - If the feature `columnMappingUsageTracking` is supported, then when adding a new column to a table and `delta.columnMapping.hasDroppedOrRenamed` column property is `false` the (logical) name of the column should be used as the physical name. + - Otherwise the physical column name must contain a universally unique identifier (UUID) to guarantee uniqueness. +- Assign a column id to each column. The maximum id that is assigned to a column is tracked as the table property `delta.columnMapping.maxColumnId`. This is an internal table property that cannot be configured by users. This value must increase monotonically as new columns are introduced and committed to the table alongside the introduction of the new columns to the schema. + +-------- + +> New subsection at the end of the `Writer Requirements for Column Mapping` subsection + +### Writer Requirements for Usage Tracking + +In order to support column mapping usage tracking, writers must: +- Write `protocol` and `metaData` actions when Column Mapping Usage Tracking is turned on for the first time: + - Write a `protocol` action with writer version 7 and the feature `columnMappingUsageTracking` in the `writerFeatures`. + - Write a `metaData` action with the table property `delta.columnMapping.hasDroppedOrRenamed` set to `false` when creating a new table or enabling the feature on an existing table without column mapping enabled, and set to `true` when enabling usage tracking on an existing table with column mapping enabled. +- When dropping or renaming a column `delta.columnMapping.hasDroppedOrRenamed` must be set to `true`. +- After `delta.columnMapping.hasDroppedOrRenamed` is set to `true` it must never be set back to `false` again.