Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No efficient way to load a subset of files from partitioned table #8906

Open
rspears74 opened this issue Jan 18, 2024 · 9 comments
Open

No efficient way to load a subset of files from partitioned table #8906

rspears74 opened this issue Jan 18, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@rspears74
Copy link

Is your feature request related to a problem or challenge?

As far as I can tell, there is no good way to load a subset of files from a partitioned table. Using ListingTable or another TableProvider like DeltaTableProvider from deltalake, I'm able to read_table, but this loads the entire table. I can also load a list of parquet files with read_parquet, but this doesn't work with partitioned tables if the partitions are not "materialized" columns in the raw parquet. The only way I've found to load partitioned files is by iterating over a list of file paths, and doing the entire TableProvider/read_table process on each one individually, and unioning the results together.

Describe the solution you'd like

It seems like it would be nice to be able to create a TableProvider with a table path, then pass some sort of file "whitelist" in. Maybe a read_table_files(TableProvider, impl IntoIterator<Item = String>).

Describe alternatives you've considered

As stated above, I've tried reading the files one-by-one and unioning results, but it's shockingly inefficient compared to reading all files at once.

Additional context

No response

@rspears74 rspears74 added the enhancement New feature or request label Jan 18, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 19, 2024

If your query contains a predicate this will be used to prune out partitions from the scan. So if you're partitioning on a column, and specify an equality predicate on that column, it will only read files in that partition. This should also hold for more complex predicates

@rspears74
Copy link
Author

My use case is incremental processing of an append-only delta table:

  1. My source table is updated.
  2. I inspect the delta log to see which files were added to the table.
  3. Process those files.

@tustvold
Copy link
Contributor

I suspect this is glossing over a lot of details of what delta-rs is doing, e.g. schema coercion, etc... but if you have already identified the parquet files in question, it should just be a case of constructing a https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html with the relevant https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.FileScanConfig.html including any partitioning.

As you are effectively side-stepping the catalog implementation, in this case delta-rs, you likely will need to either get the functionality added in there, or reimplement it yourself somehow.

@rspears74
Copy link
Author

Another important thing is I don't actually need to use delta-rs. If there were some way to use a generic ListingTable, that would work as well. read_parquet works great, but it doesn't do partition discovery and resolution the way read_table/ListingTable does. I suppose by constructing the two structs you mentioned, I could implement something like that?

@rspears74
Copy link
Author

rspears74 commented Jan 22, 2024

I've been looking into this, and running into two issues: 1: Unsure of how to actually use a ParquetExec after I've defined it. And 2: Instead, I am trying to essentially copy the implementation of ListingTable and swap out the FileScanConfig used in it's scan method, but I'm hitting some private mods and methods and having to duplicate a bunch of stuff, and the duplication is getting out of hand. If I instead know how to use a ParquetExec, I won't have to deal with issue 2.

@alamb
Copy link
Contributor

alamb commented Jan 26, 2024

And 2: Instead, I am trying to essentially copy the implementation of ListingTable and swap out the FileScanConfig used in it's scan method, but I'

Yeah, the current design of ListingTable isn't really setup at the moment to be super modular to swap out the way it determines files, etc. I think making your own version is probably the way to go.

Perhaps this can offer some inspiration: https://github.com/apache/arrow-datafusion/blob/fc752557204f4b52ab4cb38b5caff99b1b73b902/datafusion/core/tests/parquet/schema_coercion.rs#L62-L75

@rspears74
Copy link
Author

I've figured out that I don't actually need this right now (I can do everything without touching the partition columns in my table), but it still stands as a feature request I'd love to see at some point! Thanks @alamb and @tustvold

@alamb
Copy link
Contributor

alamb commented Jan 27, 2024

I've figured out that I don't actually need this right now (I can do everything without touching the partition columns in my table), but it still stands as a feature request I'd love to see at some point! Thanks @alamb and @tustvold

Glad to hear you got it to work.

I think splitting the ListingTable out a bit more so it can be customized more would help towards this goal

@alamb
Copy link
Contributor

alamb commented Jan 12, 2025

Another approach could be if we support metadata columns here:

Would be to expose the filename as a metadata column from a listing table, and then implement filter pushdown on it;

select ... 
FROM my_listing_table
WHERE filename IN ('foo.parquet', 'bar.parquet')

Or something like that 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants