Fix poor performance on (local) Parquet files with many rowgroups #2257

jaychia · 2024-05-08T19:56:46Z

Describe the bug

Daft's local Parquet reader is slow when reading Parquet files with many small rowgroups. The Polars Parquet writer currently writes files like that (attached a sample file for reference) and this appears to be a corner-case that Daft does not perform well for.

Here is a sample file that will reproduce the issue: s3://daft-public-datasets/testing_data/lineitem.parquet

The text was updated successfully, but these errors were encountered:

universalmind303 · 2024-05-18T01:49:40Z

@jaychia this file appears to not be public

 > aws s3 cp s3://daft-public-datasets/testing_data/lineitem.parquet ./lineitem.parquet --no-sign-request
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

jaychia · 2024-05-18T02:09:09Z

Just copied it to s3://daft-public-data/testing_data/bad-polars-lineitem.parquet which is our fully public bucket. Let me know if it's accessible!

universalmind303 · 2024-05-29T15:06:08Z

I haven't yet been able to identify a single bottleneck, but it seems like there are at least a few culprits.

copying/moving data during concat (I think this is the biggest one)
accessing the file metadata multiple times

I've shared a few notes in the daft slack channel https://dist-data.slack.com/archives/C052CA6Q9N1/p1716496836116429.

related to #2257 Notes for reviewer: this kind of feels like a "brute force" approach as I just added `parquet_metadata` field to `AnonymousDataFile`, but I couldn't find a better way of doing it without sufficient refactoring. If there are better alternatives, I'm all ears 😄 I also didn't really like that I had to pass it down to all of the different `*_read_parquet_*` functions, but once again, I wanted to avoid unnecessary refactoring. Some local tests using the file mentioned in the issue show over 100% increase. - main: 25sec 782ms - feature branch: 11sec 641ms

jaychia added the bug Something isn't working label May 8, 2024

jaychia added this to Daft-OSS May 8, 2024

github-project-automation bot moved this to On Deck in Daft-OSS May 8, 2024

universalmind303 mentioned this issue May 28, 2024

error writing parquet: metadata listed 100000 rows but only read: 100185 #2311

Closed

universalmind303 mentioned this issue Jun 11, 2024

[PERF]: dont read parquet metadata multiple times #2358

Merged

jaychia moved this from On Deck to In progress in Daft-OSS Jul 17, 2024

jaychia moved this from In progress to Done in Daft-OSS Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix poor performance on (local) Parquet files with many rowgroups #2257

Fix poor performance on (local) Parquet files with many rowgroups #2257

jaychia commented May 8, 2024 •

edited

Loading

universalmind303 commented May 18, 2024

jaychia commented May 18, 2024

universalmind303 commented May 29, 2024

Fix poor performance on (local) Parquet files with many rowgroups #2257

Fix poor performance on (local) Parquet files with many rowgroups #2257

Comments

jaychia commented May 8, 2024 • edited Loading

universalmind303 commented May 18, 2024

jaychia commented May 18, 2024

universalmind303 commented May 29, 2024

jaychia commented May 8, 2024 •

edited

Loading