[PERF]: dont read parquet metadata multiple times #2358

universalmind303 · 2024-06-11T20:14:26Z

related to #2257

Notes for reviewer:

this kind of feels like a "brute force" approach as I just added parquet_metadata field to AnonymousDataFile, but I couldn't find a better way of doing it without sufficient refactoring. If there are better alternatives, I'm all ears 😄

I also didn't really like that I had to pass it down to all of the different *_read_parquet_* functions, but once again, I wanted to avoid unnecessary refactoring.

Some local tests using the file mentioned in the issue show over 100% increase.

main: 25sec 782ms
feature branch: 11sec 641ms

samster25 · 2024-06-11T22:00:30Z

src/daft-scan/src/lib.rs

@@ -124,6 +125,7 @@ pub enum DataFileSource {
        metadata: Option<TableMetadata>,
        partition_spec: Option<PartitionSpec>,
        statistics: Option<TableStatistics>,
+        parquet_metadata: Option<Arc<FileMetaData>>,


I believe that CatalogDataFile would also benefit from parquet_metadata caching

samster25 · 2024-06-11T22:02:51Z

src/daft-scan/src/glob.rs

        Ok(Self {
            glob_paths: glob_paths.iter().map(|s| s.to_string()).collect(),
            file_format_config,
            schema,
            storage_config,
+            parquet_metadata,


I believe that this would pass the same parquet_metadata to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!

this should be fixed now to properly get the metadata per file.

I'll have to take a closer look at the split rowgroup function to see when that's being invoked.

samster25 · 2024-06-11T23:16:59Z

Looks like we're also timing out for the Ray tests. I think we might be slowing down the Ray Parquet reads by passing around the Parquet metadata.

jaychia · 2024-07-10T23:43:03Z

Bump on this PR -- is this in a mergable state?

universalmind303 · 2024-07-11T14:35:14Z

Bump on this PR -- is this in a mergable state?

@jaychia afaict, it should be good to go.

When reading files from AWS, the logical to physical plan translator previously fetched the metadata for the files sequentially, which could be very slow if there are many files. (This bug was introduced in #2358.) This PR makes it so that we only cache the metadata upon splitting, and when we do so we only cache the row groups that are actually relevant to each scan task. This avoids serializing the entire metadata for each Ray runner, which should improve performance. Benchmark results: <img width="516" alt="image" src="https://github.com/user-attachments/assets/ba013482-89be-413f-89da-8f0e8fcf4cd7">

dont read metadata multiple times

9eb1491

universalmind303 changed the title ~~[FIX]: dont read parquet metadata multiple times~~ [PERF]: dont read parquet metadata multiple times Jun 11, 2024

github-actions bot added the performance (legacy) Please use the "perf" label instead label Jun 11, 2024

remove println

46f9db3

samster25 reviewed Jun 11, 2024

View reviewed changes

properly get metadata

5ba7192

dont use metadata if is_ray_runner

9280675

samster25 merged commit c5f2d4a into Eventual-Inc:main Jul 16, 2024
40 checks passed

Vince7778 mentioned this pull request Aug 21, 2024

[PERF] Fix excessive parquet metadata reading #2694

Merged

universalmind303 deleted the make-parquet-faster branch January 23, 2025 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF]: dont read parquet metadata multiple times #2358

[PERF]: dont read parquet metadata multiple times #2358

universalmind303 commented Jun 11, 2024

samster25 Jun 11, 2024

samster25 Jun 11, 2024

universalmind303 Jun 11, 2024

samster25 commented Jun 11, 2024

jaychia commented Jul 10, 2024

universalmind303 commented Jul 11, 2024

[PERF]: dont read parquet metadata multiple times #2358

[PERF]: dont read parquet metadata multiple times #2358

Conversation

universalmind303 commented Jun 11, 2024

samster25 Jun 11, 2024

Choose a reason for hiding this comment

samster25 Jun 11, 2024

Choose a reason for hiding this comment

universalmind303 Jun 11, 2024

Choose a reason for hiding this comment

samster25 commented Jun 11, 2024

jaychia commented Jul 10, 2024

universalmind303 commented Jul 11, 2024