You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think the PySpark API is a lot better from a usability perspective because the user doesn't need to know about the underlying partitioning of the data.
I think the user should be able to specify what data they would like to be compacted. Delta Lake should be smart enough to determine if that means compacting the files in a given partition or running a filtering query and determining the files that need compaction.
The text was updated successfully, but these errors were encountered:
We should indeed not have users think about the partitioning structure. I think the partition filter for the pyarrow writer was mainly there because pyarrow was used. With MERGE we use datafusion and there we properly pass predicates.
Also, I think it's more pythonic to have an optional parameter called predicate instead of another method. We also do that in TableMerger. In the new rust engine binding I am also exposing a predicate parameter but only as string input.
I do wonder, @MrPowers does the optimize operation work if you pass a predicate that is not based on the partitioning structure?
~~based on #1807~~
# Description
In the effort to advance protocol support and move our internal APIs
closer to the kernel library, it is advantageous to leverage the
expression handling logic from kernel specifically for filtering actions
etc.
This PR just add the expression definitions and evaluation logic.
Integrating it with our current codebase and basing the existing
partition handling logic on this is left for follow up PRs to keep thigs
review-able.
related: #1894, #1776
Description
We're currently exposing
partition_filters
as part of the public-facing API for some methods.For example,
compact()
has an optionalpartition_filers
argument.Let's compare this with the PySpark API:
I think the PySpark API is a lot better from a usability perspective because the user doesn't need to know about the underlying partitioning of the data.
I think the user should be able to specify what data they would like to be compacted. Delta Lake should be smart enough to determine if that means compacting the files in a given partition or running a filtering query and determining the files that need compaction.
The text was updated successfully, but these errors were encountered: