-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(python): Use polars parquet reader for delta scan #19103
refactor(python): Use polars parquet reader for delta scan #19103
Conversation
Nice, I didn't know we could bring our own readers. |
Absolutely we can, this is actually encouraged. Polars parquet reader is also lots faster than pyarrow. The only thing is, this intermediate stage will be bound to same protocol support as the pyarrow scanner. At some point we need to finish a full native reader polars and delta-kernel-rs, some preliminary work I did here: https://github.com/ion-elgreco/polars-deltalake/tree/feat/delta_io_plugin Just hard to find time nowadays. A dev from the core team who is working on parquet could probably do this easier since that dev is deep into the polars rust code ^^ |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19103 +/- ##
==========================================
- Coverage 79.91% 79.50% -0.42%
==========================================
Files 1536 1547 +11
Lines 211659 213602 +1943
Branches 2445 2451 +6
==========================================
+ Hits 169156 169830 +674
- Misses 41948 43218 +1270
+ Partials 555 554 -1 ☔ View full report in Codecov by Sentry. |
@ritchie46 one test fails on windows: https://github.com/pola-rs/polars/actions/runs/11192235502/job/31116136984?pr=19103#step:11:202 when it encounters a hive path, I am not seeing this on linux though |
@nameexhaustion can you take a look at this one later? |
The Windows test should be fixed after a rebase on main |
Great, I'm currently out so will take a look at it somewhere mid November |
Am I correct that this doesn't use the delta log for partition pruning? It seems we ought to go through the io plugin framework and then use I haven't worked with that io plugin framework to figure out how to avoid the following being a two step process. For instance suppose what I want is df = scan_delta(dt).filter(pl.col('node_id')==something).collect() I can two step that like this files = pl.from_arrow(dt.get_add_actions()).filter((pl.col('min').struct.field('node_id')<=something) & (pl.col('max').struct.field('node_id')>=something))['path']
df=pl.scan_parquet([dt.table_uri+'/'+x for x in files]).filter(pl.col('node_id')==something).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I've left a few comments, but we can get this in very shortly.
This seems to be a breaking change/broke support for S3 (actually in my case i also need AWS_ENDPOINT)
I was doing this (as seen in polars docs https://docs.pola.rs/api/python/stable/reference/api/polars.scan_delta.html)
I tried moving to the lower-case version I was using for parquets
but it seems to be stuck (?) |
Delta-rs and Pola-rs use both object store but perhaps delta-rs is more lenient in whether it's uppercase or not. Can you try passing lowercase keys? |
@ion-elgreco yeah lowercase works, Actually I had this in my wrappers (lowercasing for non-delta). docs need to be updated in that case. However something broke in my end, I have a small table with lots and lots of files and fragmentation (its a unit-test-level table so it's acumulating metadata, delta_log has 2106 jsons) Im doing
this works and finishes in a few seconds while
is sloooooooow (aka I don't know if it ends), maybe this change broke something? |
You should create a separate issue for showcasing that uppercase storage keys are not used. Regarding your slowness with streaming that an issue in Pola-rs directly, not related to this change since this change only dispatches the reading to Pola-rs instead of pyarrow |
@ion-elgreco kk thanks I had another issue because I was using the same parameters to writer and reader, and "aws_s3_allow_unsafe_rename" is not supported in reader aswell (I'm not using plain s3, alternative impl). With that taken care and lower-case everything works (removing the streaming=True, it was there because of a mistake back in the day) |
aws_s3_allow_unsafe_rename is a custom storage option, specific to delta-rs. But Pola-rs should ideally filter those ones out or return some logs with a warning |
Yeah I realized its because of that, thanks for the fast answer I created the ticket for docs: However given that 1.14.0 is not announcing it as a breaking change (even if polars is still early days) i'm not sure if a "deprecation-key-translator" should be added until 2.0.0 is there. Not my call though |
Strange, looks like an issue in the parquet reader I guess, Can you share you like first couple ADD actions paths? |
there's no data/parquet files in the table, maybe thats what is unexpected? this is the __delta_log initial json (the only thing in the path as of now) I can't share the full schema though, I manually removed over 50 columns (from my company) :/, but its like this
|
That's likely the issue yeah, that there is no data. Polars should always return that column in the frame regardless whether there is data or not. You should create a separate issue for that |
Done #19854 thanks! Also attached a manually crafted delta table which shows the issue |
Hey @ion-elgreco, I was previously using
Both of them are across 17 million records Have you done any benchmarking on this change in the new fix? Thank you! 😊 |
@Ajay95 usually it was 4-5x faster. I would create a reproducible example and make an issue so that Pola-rs devs can look into it. The delta reader heavy lifting is now mostly done by the parquet scanner |
This an intermediate stage until we have something working with delta-kernel-rs.
Couple odd things: