Rethink exposing partition_filters as part of the public facing API #1894

MrPowers · 2023-11-21T13:00:56Z

Description

We're currently exposing partition_filters as part of the public-facing API for some methods.

For example, compact() has an optional partition_filers argument.

Let's compare this with the PySpark API:

deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

I think the PySpark API is a lot better from a usability perspective because the user doesn't need to know about the underlying partitioning of the data.

I think the user should be able to specify what data they would like to be compacted. Delta Lake should be smart enough to determine if that means compacting the files in a given partition or running a filtering query and determining the files that need compaction.

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2023-11-25T14:58:29Z

We should indeed not have users think about the partitioning structure. I think the partition filter for the pyarrow writer was mainly there because pyarrow was used. With MERGE we use datafusion and there we properly pass predicates.

Also, I think it's more pythonic to have an optional parameter called predicate instead of another method. We also do that in TableMerger. In the new rust engine binding I am also exposing a predicate parameter but only as string input.

I do wonder, @MrPowers does the optimize operation work if you pass a predicate that is not based on the partitioning structure?

~~based on #1807~~ # Description In the effort to advance protocol support and move our internal APIs closer to the kernel library, it is advantageous to leverage the expression handling logic from kernel specifically for filtering actions etc. This PR just add the expression definitions and evaluation logic. Integrating it with our current codebase and basing the existing partition handling logic on this is left for follow up PRs to keep thigs review-able. related: #1894, #1776

MrPowers added the enhancement New feature or request label Nov 21, 2023

ion-elgreco added this to the python v0.20 milestone Nov 26, 2023

roeap mentioned this issue Dec 3, 2023

feat: add kernel ExpressionEvaluator #1829

Merged

rtyler removed this from the python v0.20 milestone Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink exposing partition_filters as part of the public facing API #1894

Rethink exposing partition_filters as part of the public facing API #1894

MrPowers commented Nov 21, 2023

ion-elgreco commented Nov 25, 2023

Rethink exposing partition_filters as part of the public facing API #1894

Rethink exposing partition_filters as part of the public facing API #1894

Comments

MrPowers commented Nov 21, 2023

Description

ion-elgreco commented Nov 25, 2023