Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink exposing partition_filters as part of the public facing API #1894

Open
MrPowers opened this issue Nov 21, 2023 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@MrPowers
Copy link
Contributor

Description

We're currently exposing partition_filters as part of the public-facing API for some methods.

For example, compact() has an optional partition_filers argument.

Let's compare this with the PySpark API:

deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

I think the PySpark API is a lot better from a usability perspective because the user doesn't need to know about the underlying partitioning of the data.

I think the user should be able to specify what data they would like to be compacted. Delta Lake should be smart enough to determine if that means compacting the files in a given partition or running a filtering query and determining the files that need compaction.

@MrPowers MrPowers added the enhancement New feature or request label Nov 21, 2023
@ion-elgreco
Copy link
Collaborator

We should indeed not have users think about the partitioning structure. I think the partition filter for the pyarrow writer was mainly there because pyarrow was used. With MERGE we use datafusion and there we properly pass predicates.

Also, I think it's more pythonic to have an optional parameter called predicate instead of another method. We also do that in TableMerger. In the new rust engine binding I am also exposing a predicate parameter but only as string input.

I do wonder, @MrPowers does the optimize operation work if you pass a predicate that is not based on the partitioning structure?

@ion-elgreco ion-elgreco added this to the python v0.20 milestone Nov 26, 2023
roeap added a commit that referenced this issue Dec 11, 2023
~~based on #1807~~

# Description

In the effort to advance protocol support and move our internal APIs
closer to the kernel library, it is advantageous to leverage the
expression handling logic from kernel specifically for filtering actions
etc.

This PR just add the expression definitions and evaluation logic.
Integrating it with our current codebase and basing the existing
partition handling logic on this is left for follow up PRs to keep thigs
review-able.

related: #1894, #1776
@rtyler rtyler removed this from the python v0.20 milestone Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants