Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OPTIMIZE on Delta tables with DVs #1578

Closed
wants to merge 1 commit into from

Conversation

vkorukanti
Copy link
Collaborator

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

This PR adds support for running OPTIMIZE (file compaction or Z-Order By) on Delta tables with deletion vectors. It changes the following:

  • Selection criteria
    • File compaction: earlier we used to select files with size below optimize.minFileSize for compaction. Now we also consider the ratio of rows deleted in a file. If the deleted rows ratio is above optimize.maxDeletedRowsRatio (default 0.05), then it is also selected for compaction (which removes the DVs)
    • Z-Order: This hasn't been changed. We always select all the files in the selected partitions, so if a file has DV it gets removed as part of the Z-order by
  • Reading selected files with DV for OPTIMIZE: We go through the same read path as Delta table read which removes the deleted rows (according to the DV) from the scan output.
  • Metrics for deleted DVs

Added tests.

GitOrigin-RevId: b64d8beec8278e6665813642753ef0a19af5c985

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for running OPTIMIZE (file compaction or Z-Order By) on Delta tables with deletion vectors. It changes the following:
* Selection criteria
   * File compaction: earlier we used to select files with size below `optimize.minFileSize` for compaction. Now we also consider the ratio of rows deleted in a file. If the deleted rows ratio is above `optimize.maxDeletedRowsRatio` (default 0.05), then it is also selected for compaction (which removes the DVs)
   * Z-Order: This hasn't been changed. We always select all the files in the selected partitions, so if a file has DV it gets removed as part of the Z-order by
* Reading selected files with DV for OPTIMIZE: We go through the same read path as Delta table read which removes the deleted rows (according to the DV) from the scan output.
* Metrics for deleted DVs

Added tests.

GitOrigin-RevId: b64d8beec8278e6665813642753ef0a19af5c985
Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vkorukanti vkorukanti deleted the dv-optimize branch October 2, 2023 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants