No efficient way to load a subset of files from partitioned table #8906

rspears74 · 2024-01-18T15:22:45Z

Is your feature request related to a problem or challenge?

As far as I can tell, there is no good way to load a subset of files from a partitioned table. Using ListingTable or another TableProvider like DeltaTableProvider from deltalake, I'm able to read_table, but this loads the entire table. I can also load a list of parquet files with read_parquet, but this doesn't work with partitioned tables if the partitions are not "materialized" columns in the raw parquet. The only way I've found to load partitioned files is by iterating over a list of file paths, and doing the entire TableProvider/read_table process on each one individually, and unioning the results together.

Describe the solution you'd like

It seems like it would be nice to be able to create a TableProvider with a table path, then pass some sort of file "whitelist" in. Maybe a read_table_files(TableProvider, impl IntoIterator<Item = String>).

Describe alternatives you've considered

As stated above, I've tried reading the files one-by-one and unioning results, but it's shockingly inefficient compared to reading all files at once.

Additional context

No response

The text was updated successfully, but these errors were encountered:

tustvold · 2024-01-19T08:51:39Z

If your query contains a predicate this will be used to prune out partitions from the scan. So if you're partitioning on a column, and specify an equality predicate on that column, it will only read files in that partition. This should also hold for more complex predicates

rspears74 · 2024-01-19T13:39:33Z

My use case is incremental processing of an append-only delta table:

My source table is updated.
I inspect the delta log to see which files were added to the table.
Process those files.

tustvold · 2024-01-19T13:43:52Z

I suspect this is glossing over a lot of details of what delta-rs is doing, e.g. schema coercion, etc... but if you have already identified the parquet files in question, it should just be a case of constructing a https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html with the relevant https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html including any partitioning.

As you are effectively side-stepping the catalog implementation, in this case delta-rs, you likely will need to either get the functionality added in there, or reimplement it yourself somehow.

rspears74 · 2024-01-19T22:36:29Z

Another important thing is I don't actually need to use delta-rs. If there were some way to use a generic ListingTable, that would work as well. read_parquet works great, but it doesn't do partition discovery and resolution the way read_table/ListingTable does. I suppose by constructing the two structs you mentioned, I could implement something like that?

rspears74 · 2024-01-22T19:57:58Z

I've been looking into this, and running into two issues: 1: Unsure of how to actually use a ParquetExec after I've defined it. And 2: Instead, I am trying to essentially copy the implementation of ListingTable and swap out the FileScanConfig used in it's scan method, but I'm hitting some private mods and methods and having to duplicate a bunch of stuff, and the duplication is getting out of hand. If I instead know how to use a ParquetExec, I won't have to deal with issue 2.

alamb · 2024-01-26T20:50:59Z

And 2: Instead, I am trying to essentially copy the implementation of ListingTable and swap out the FileScanConfig used in it's scan method, but I'

Yeah, the current design of ListingTable isn't really setup at the moment to be super modular to swap out the way it determines files, etc. I think making your own version is probably the way to go.

Perhaps this can offer some inspiration: https://github.com/apache/arrow-datafusion/blob/fc752557204f4b52ab4cb38b5caff99b1b73b902/datafusion/core/tests/parquet/schema_coercion.rs#L62-L75

rspears74 · 2024-01-26T23:08:32Z

I've figured out that I don't actually need this right now (I can do everything without touching the partition columns in my table), but it still stands as a feature request I'd love to see at some point! Thanks @alamb and @tustvold

alamb · 2024-01-27T16:27:31Z

I've figured out that I don't actually need this right now (I can do everything without touching the partition columns in my table), but it still stands as a feature request I'd love to see at some point! Thanks @alamb and @tustvold

Glad to hear you got it to work.

I think splitting the ListingTable out a bit more so it can be customized more would help towards this goal

alamb · 2025-01-12T12:21:57Z

Another approach could be if we support metadata columns here:

feat: metadata columns #14057

Would be to expose the filename as a metadata column from a listing table, and then implement filter pushdown on it;

select ... 
FROM my_listing_table
WHERE filename IN ('foo.parquet', 'bar.parquet')

Or something like that 🤔

rspears74 added the enhancement New feature or request label Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No efficient way to load a subset of files from partitioned table #8906

No efficient way to load a subset of files from partitioned table #8906

rspears74 commented Jan 18, 2024

tustvold commented Jan 19, 2024 •

edited

Loading

rspears74 commented Jan 19, 2024

tustvold commented Jan 19, 2024

rspears74 commented Jan 19, 2024

rspears74 commented Jan 22, 2024 •

edited

Loading

alamb commented Jan 26, 2024

rspears74 commented Jan 26, 2024

alamb commented Jan 27, 2024

alamb commented Jan 12, 2025

No efficient way to load a subset of files from partitioned table #8906

No efficient way to load a subset of files from partitioned table #8906

Comments

rspears74 commented Jan 18, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented Jan 19, 2024 • edited Loading

rspears74 commented Jan 19, 2024

tustvold commented Jan 19, 2024

rspears74 commented Jan 19, 2024

rspears74 commented Jan 22, 2024 • edited Loading

alamb commented Jan 26, 2024

rspears74 commented Jan 26, 2024

alamb commented Jan 27, 2024

alamb commented Jan 12, 2025

tustvold commented Jan 19, 2024 •

edited

Loading

rspears74 commented Jan 22, 2024 •

edited

Loading