Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip column mapping metadata when feature is disabled #3688

Merged
merged 1 commit into from
Sep 19, 2024

Conversation

cstavr
Copy link
Contributor

@cstavr cstavr commented Sep 19, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Transactions might try to create or update the schema of a Delta table with columns that contain column mapping metadata, even when column mapping is not enabled. For example, this can happen when transactions copy the schema from another table without stripping metadata.

To avoid such issues, we automatically strip column mapping metadata when column mapping is disabled. We are doing this only for new tables or for transactions that add column mapping metadata for the first time. If column metadata already exist, we cannot strip them because this would break the table. A usage log is emitted so we can understand the impact on existing tables.

Note that this change covers the cases where txn.updateMetadata is called (the "proper API") and not the cases where a Metadata action is directly committed to the table.

Finally, this commit changes drop column mapping command to check that all column mapping metadata do not exist, and not only physical column name and ID.

How was this patch tested?

Added new UT.

Does this PR introduce any user-facing changes?

No.

@vkorukanti vkorukanti merged commit d07a7bd into delta-io:master Sep 19, 2024
14 of 17 checks passed
scottsand-db pushed a commit that referenced this pull request Feb 19, 2025
…s already present (#4167)

#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

As effect of earlier bugs (e.g. fixed in
#3487) there can exists tables
where column mapping is disabled, but there is column mapping metadata
on the table. Enabling column mapping metadata on such a table could
lead to unexpected corruption. Simply stripping such metadata could also
lead to curruptions, as the invalid metadata can be already used in
other places (e.g. column statistics) via
DeltaColumnMapping.getPhysicalName, which returns the name from the
metadata even when column mapping is disabled.

After #3688 it should no longer be
possible to end up with tables having such invalid metadata, so the
issue only concerns existing tables created before that fix.

To avoid corruption, we want to disallow enabling column mapping on such
tables.

## How was this patch tested?

Added tests to DeltaColumnMappingSuite.

## Does this PR introduce _any_ user-facing changes?

No.
We are disallowing an operation on tables that would lead to Delta table
corruption on tables that are already in an invalid state entering which
is fixed already, so it can only concern old tables in the wild.

---------

Co-authored-by: Julek Sompolski <Juliusz Sompolski>
anoopj pushed a commit to anoopj/delta that referenced this pull request Feb 24, 2025
…s already present (delta-io#4167)

#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

As effect of earlier bugs (e.g. fixed in
delta-io#3487) there can exists tables
where column mapping is disabled, but there is column mapping metadata
on the table. Enabling column mapping metadata on such a table could
lead to unexpected corruption. Simply stripping such metadata could also
lead to curruptions, as the invalid metadata can be already used in
other places (e.g. column statistics) via
DeltaColumnMapping.getPhysicalName, which returns the name from the
metadata even when column mapping is disabled.

After delta-io#3688 it should no longer be
possible to end up with tables having such invalid metadata, so the
issue only concerns existing tables created before that fix.

To avoid corruption, we want to disallow enabling column mapping on such
tables.

## How was this patch tested?

Added tests to DeltaColumnMappingSuite.

## Does this PR introduce _any_ user-facing changes?

No.
We are disallowing an operation on tables that would lead to Delta table
corruption on tables that are already in an invalid state entering which
is fixed already, so it can only concern old tables in the wild.

---------

Co-authored-by: Julek Sompolski <Juliusz Sompolski>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants