-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF]: dont read parquet metadata multiple times #2358
[PERF]: dont read parquet metadata multiple times #2358
Conversation
@@ -124,6 +125,7 @@ pub enum DataFileSource { | |||
metadata: Option<TableMetadata>, | |||
partition_spec: Option<PartitionSpec>, | |||
statistics: Option<TableStatistics>, | |||
parquet_metadata: Option<Arc<FileMetaData>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that CatalogDataFile
would also benefit from parquet_metadata
caching
src/daft-scan/src/glob.rs
Outdated
Ok(Self { | ||
glob_paths: glob_paths.iter().map(|s| s.to_string()).collect(), | ||
file_format_config, | ||
schema, | ||
storage_config, | ||
parquet_metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this would pass the same parquet_metadata
to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fixed now to properly get the metadata per file.
I'll have to take a closer look at the split rowgroup function to see when that's being invoked.
Looks like we're also timing out for the Ray tests. I think we might be slowing down the Ray Parquet reads by passing around the Parquet metadata. |
Bump on this PR -- is this in a mergable state? |
@jaychia afaict, it should be good to go. |
When reading files from AWS, the logical to physical plan translator previously fetched the metadata for the files sequentially, which could be very slow if there are many files. (This bug was introduced in #2358.) This PR makes it so that we only cache the metadata upon splitting, and when we do so we only cache the row groups that are actually relevant to each scan task. This avoids serializing the entire metadata for each Ray runner, which should improve performance. Benchmark results: <img width="516" alt="image" src="https://github.com/user-attachments/assets/ba013482-89be-413f-89da-8f0e8fcf4cd7">
related to #2257
Notes for reviewer:
this kind of feels like a "brute force" approach as I just added
parquet_metadata
field toAnonymousDataFile
, but I couldn't find a better way of doing it without sufficient refactoring. If there are better alternatives, I'm all ears 😄I also didn't really like that I had to pass it down to all of the different
*_read_parquet_*
functions, but once again, I wanted to avoid unnecessary refactoring.Some local tests using the file mentioned in the issue show over 100% increase.