-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix poor performance on (local) Parquet files with many rowgroups #2257
Comments
@jaychia this file appears to not be public
|
Just copied it to |
I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.
I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429. |
related to #2257 Notes for reviewer: this kind of feels like a "brute force" approach as I just added `parquet_metadata` field to `AnonymousDataFile`, but I couldn't find a better way of doing it without sufficient refactoring. If there are better alternatives, I'm all ears 😄 I also didn't really like that I had to pass it down to all of the different `*_read_parquet_*` functions, but once again, I wanted to avoid unnecessary refactoring. Some local tests using the file mentioned in the issue show over 100% increase. - main: 25sec 782ms - feature branch: 11sec 641ms
Describe the bug
Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.
Here is a sample file that will reproduce the issue:
s3://daft-public-datasets/testing_data/lineitem.parquet
The text was updated successfully, but these errors were encountered: