-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added toggle for Parquet V1 and V2 formats #9497
Conversation
This sounds like important piece to solve #6377
It seems like we didn't make a very deliberate decision to use file format version V2 and we later realized this is causing problems (#7953) We should switch to v1 for now, unless we can name actual benefits of having a toggle. |
per @anjalinorwood #7953 (comment) changing the |
+1000
|
Yes, I agree that we should switch to V1 since that's what's supported more broadly. I think we should keep V2 as an experimental flag. It's useful to be able to write data in the new format for others to be able to test compatibility as they develop that support in their engines. |
@anjalinorwood Based on your comments in #7953, did you have an explicit test case where something else was needed to support the V1 spec? Writing
|
it would be awesome if Trino could become the reference implementation for Parquet v2 format, but i would assume parquet-mr is going to be that no matter what we do. Toggles are not something to add haphazardly. There is a cost to them in terms of code and cognitive complexity that affects future maintainability and ability to safely evolve the system.
Good question. While we look for the answer to that, i think we should also take a look what it would take to produce files that are readable by Hive. |
If the other PR gets merged first we'll have a small merge conflict with https://github.com/trinodb/trino/pull/9569/files/fd0c06389c6977de7ecc96a70ec6959086f90446#r725263578 But it should just be another case of switching a "v2" to a "v1" |
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>
Cherry-pick of trinodb/trino#9497 and trinodb/trino#9611 co-authored by Josh Howard <[email protected]>
This PR creates a toggle to write data in the Parquet V1 format using the native writer. The default version remains the same to ensure that there are no compatibility issues here.
With this change:
org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.hive.serde2.io.HiveVarcharWritable
for all tested Hive versions (there are mixed errors for Parquet V2 files)Why does this seem to work? (TBH I wouldn't have expected it to)
ValuesWriter
is injected intoPrimitiveColumnWriter
here.valuesWriterFactory
creates a newValuesWriter
here.DefaultValuesWriterFactory
hereDefaultValuesWriterFactory
respects the Parquet V1 or V2 flag for constructingValuesWriter
s here.This PR is still a WIP, but I wanted to post this for discussion.