-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update writers to include compression method in file name #1431
Conversation
Thanks for fixing this. 😄
I'd like to know more about how they use this. Each column is allowed to have a different compression in Parquet, so the Java readers need to be ready to use any compression decoder anyways. If there's not much benefit, it seems like it would be better to just write out a plain |
@houqp since you provided the original insight in the other PR can you help direct us to the source of that claim? |
// TODO: what does c000 mean? | ||
let file_name = format!( | ||
"part-{}-{}-c000{}.parquet", | ||
part, | ||
writer_id, | ||
compression_to_str(&compression) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to be trying to copy Spark. This information is mostly added to make debugging easier. For example, they have a writer id because they might want to see all the files written by a particular node in a Spark cluster. But we don't have nodes or retries like Spark does. So I think we can have our own convention.
At a minimum, I think all we need is {uuid}.parquet
. This is just to make sure we don't have collisions between files. Adding the compression codec can be nice for debugging. If we find more things that's are useful to delta-rs, we can consider adding them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for locating the Spark source. I tried searching myself but only found the mapping for orc files.
I left some of the comments / TODO that were scattered in the implementation just in case if we wanted to completely copy the Spark approach. Since all writers now use this function it should be easier in the future to make our own convention as you mentioned.
Co-authored-by: Will Jones <[email protected]>
Description
The compression name was hard-coded to include snappy however users can now specify their own methods which will cause a disconnect in the the name and the method used.
The naming convention is used as a hint by spark and hive to create the correct reader without reading having read the header of the file.