Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix poor performance on (local) Parquet files with many rowgroups #2257

Open
jaychia opened this issue May 8, 2024 · 3 comments
Open

Fix poor performance on (local) Parquet files with many rowgroups #2257

jaychia opened this issue May 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jaychia
Copy link
Contributor

jaychia commented May 8, 2024

Describe the bug

Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.

Here is a sample file that will reproduce the issue: s3://daft-public-datasets/testing_data/lineitem.parquet

@jaychia jaychia added the bug Something isn't working label May 8, 2024
@jaychia jaychia added this to Daft-OSS May 8, 2024
@github-project-automation github-project-automation bot moved this to On Deck in Daft-OSS May 8, 2024
@universalmind303
Copy link
Contributor

@jaychia this file appears to not be public

 > aws s3 cp s3://daft-public-datasets/testing_data/lineitem.parquet ./lineitem.parquet --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

@jaychia
Copy link
Contributor Author

jaychia commented May 18, 2024

Just copied it to s3://daft-public-data/testing_data/bad-polars-lineitem.parquet which is our fully public bucket. Let me know if it's accessible!

@universalmind303
Copy link
Contributor

I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.

  • copying/moving data during concat (I think this is the biggest one)
  • accessing the file metadata multiple times

I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429.

samster25 pushed a commit that referenced this issue Jul 16, 2024
related to #2257


Notes for reviewer:

this kind of feels like a "brute force" approach as I just added
`parquet_metadata` field to `AnonymousDataFile`, but I couldn't find a
better way of doing it without sufficient refactoring. If there are
better alternatives, I'm all ears 😄

I also didn't really like that I had to pass it down to all of the
different `*_read_parquet_*` functions, but once again, I wanted to
avoid unnecessary refactoring.



Some local tests using the file mentioned in the issue show over 100%
increase.

- main: 25sec 782ms
- feature branch: 11sec 641ms
@jaychia jaychia moved this from On Deck to In progress in Daft-OSS Jul 17, 2024
@jaychia jaychia moved this from In progress to Done in Daft-OSS Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants