Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF]: dont read parquet metadata multiple times #2358

Merged
merged 4 commits into from
Jul 16, 2024

Conversation

universalmind303
Copy link
Contributor

related to #2257

Notes for reviewer:

this kind of feels like a "brute force" approach as I just added parquet_metadata field to AnonymousDataFile, but I couldn't find a better way of doing it without sufficient refactoring. If there are better alternatives, I'm all ears 😄

I also didn't really like that I had to pass it down to all of the different *_read_parquet_* functions, but once again, I wanted to avoid unnecessary refactoring.

Some local tests using the file mentioned in the issue show over 100% increase.

  • main: 25sec 782ms
  • feature branch: 11sec 641ms

@universalmind303 universalmind303 changed the title [FIX]: dont read parquet metadata multiple times [PERF]: dont read parquet metadata multiple times Jun 11, 2024
@github-actions github-actions bot added the performance (legacy) Please use the "perf" label instead label Jun 11, 2024
@@ -124,6 +125,7 @@ pub enum DataFileSource {
metadata: Option<TableMetadata>,
partition_spec: Option<PartitionSpec>,
statistics: Option<TableStatistics>,
parquet_metadata: Option<Arc<FileMetaData>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that CatalogDataFile would also benefit from parquet_metadata caching

Ok(Self {
glob_paths: glob_paths.iter().map(|s| s.to_string()).collect(),
file_format_config,
schema,
storage_config,
parquet_metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this would pass the same parquet_metadata to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed now to properly get the metadata per file.

I'll have to take a closer look at the split rowgroup function to see when that's being invoked.

@samster25
Copy link
Member

Looks like we're also timing out for the Ray tests. I think we might be slowing down the Ray Parquet reads by passing around the Parquet metadata.

@jaychia
Copy link
Contributor

jaychia commented Jul 10, 2024

Bump on this PR -- is this in a mergable state?

@universalmind303
Copy link
Contributor Author

Bump on this PR -- is this in a mergable state?

@jaychia afaict, it should be good to go.

@samster25 samster25 merged commit c5f2d4a into Eventual-Inc:main Jul 16, 2024
40 checks passed
Vince7778 added a commit that referenced this pull request Aug 22, 2024
When reading files from AWS, the logical to physical plan translator
previously fetched the metadata for the files sequentially, which could
be very slow if there are many files. (This bug was introduced in
#2358.)

This PR makes it so that we only cache the metadata upon splitting, and
when we do so we only cache the row groups that are actually relevant to
each scan task. This avoids serializing the entire metadata for each Ray
runner, which should improve performance.

Benchmark results:
<img width="516" alt="image"
src="https://github.com/user-attachments/assets/ba013482-89be-413f-89da-8f0e8fcf4cd7">
@universalmind303 universalmind303 deleted the make-parquet-faster branch January 23, 2025 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance (legacy) Please use the "perf" label instead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants