-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No efficient way to load a subset of files from partitioned table #8906
Comments
If your query contains a predicate this will be used to prune out partitions from the scan. So if you're partitioning on a column, and specify an equality predicate on that column, it will only read files in that partition. This should also hold for more complex predicates |
My use case is incremental processing of an append-only delta table:
|
I suspect this is glossing over a lot of details of what delta-rs is doing, e.g. schema coercion, etc... but if you have already identified the parquet files in question, it should just be a case of constructing a https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html with the relevant https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html including any partitioning. As you are effectively side-stepping the catalog implementation, in this case delta-rs, you likely will need to either get the functionality added in there, or reimplement it yourself somehow. |
Another important thing is I don't actually need to use delta-rs. If there were some way to use a generic |
I've been looking into this, and running into two issues: 1: Unsure of how to actually use a |
Yeah, the current design of ListingTable isn't really setup at the moment to be super modular to swap out the way it determines files, etc. I think making your own version is probably the way to go. Perhaps this can offer some inspiration: https://github.com/apache/arrow-datafusion/blob/fc752557204f4b52ab4cb38b5caff99b1b73b902/datafusion/core/tests/parquet/schema_coercion.rs#L62-L75 |
Glad to hear you got it to work. I think splitting the ListingTable out a bit more so it can be customized more would help towards this goal |
Another approach could be if we support metadata columns here: Would be to expose the filename as a metadata column from a listing table, and then implement filter pushdown on it; select ...
FROM my_listing_table
WHERE filename IN ('foo.parquet', 'bar.parquet') Or something like that 🤔 |
Is your feature request related to a problem or challenge?
As far as I can tell, there is no good way to load a subset of files from a partitioned table. Using
ListingTable
or anotherTableProvider
likeDeltaTableProvider
fromdeltalake
, I'm able toread_table
, but this loads the entire table. I can also load a list of parquet files withread_parquet
, but this doesn't work with partitioned tables if the partitions are not "materialized" columns in the raw parquet. The only way I've found to load partitioned files is by iterating over a list of file paths, and doing the entireTableProvider
/read_table
process on each one individually, andunion
ing the results together.Describe the solution you'd like
It seems like it would be nice to be able to create a
TableProvider
with a table path, then pass some sort of file "whitelist" in. Maybe aread_table_files(TableProvider, impl IntoIterator<Item = String>)
.Describe alternatives you've considered
As stated above, I've tried reading the files one-by-one and
union
ing results, but it's shockingly inefficient compared to reading all files at once.Additional context
No response
The text was updated successfully, but these errors were encountered: