-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow adding user defined metadata to ParquetSink
#10224
Allow adding user defined metadata to ParquetSink
#10224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- this looks great to me. I had some small documentation and test tweak suggestions, but I also think this PR could be merged as is and we could do them as a follow on PR
Just let me know what you prefer
cc @devinjdangelo and @metesynnada who I think have worked on this code recently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld the code looks good to me. The expected syntax of the metadata value isn't 100% clear to me, but this could just be down to the fact I've not worked with parquet kv metadata before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking close
TO 'test_files/scratch/copy/table_with_metadata/' | ||
STORED AS PARQUET | ||
OPTIONS ( | ||
'format.metadata' 'key1:value1 key2:value2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite and I missed it the first time around -- nice eyes @devinjdangelo . @wiedld could you please add documentation to the TableParquetOptions
that documents this behavior?
Specifically, I would be interested to know "what if you want to store metadata values that have spaces in them" (key1:my value with spaces
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it would be fine if the answer is "you will get an error / is not supported yet" -- it might just be good to document that behavior
I could see wanting to support things like key1:"my awesome value" key2:"my other awesome value"
(again not in this PR, but we should at least document how it works I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could leverage a syntax like:
(
'format.metadata.key1' 'val1',
'format.metadata.key2' 'val2 with space',
...
)
to support values with spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆 You beat me to it.
I started with the internal double quotes approach, also considered escaped spaces; then I realized that these are introducing lexical rules which did not feel SQL appropriate. Landed on the approach suggested by @devinjdangelo , commit will be up shortly.
6d6e5ba
to
b7808f1
Compare
… key parsing be a part of the format.metadata::key
b7808f1
to
5b06bf3
Compare
# accepts multiple entries with the same key (will overwrite) | ||
statement ok | ||
COPY source_table | ||
TO 'test_files/scratch/copy/table_with_metadata/' | ||
STORED AS PARQUET | ||
OPTIONS ( | ||
'format.metadata::key1' 'value', | ||
'format.metadata::key1' 'value' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overwriting is a common feature to all of the config OPTIONS(...)
. If we want to change, then I recommend a followup ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there is no need to change it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice -- thank you for changes @wiedld and thanks for the comments and suggestions @devinjdangelo . I think this PR looks very nice
/// Multiple entries are permitted | ||
/// ```sql | ||
/// OPTIONS ( | ||
/// 'format.metadata::key1' '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.set("format.metadata::key3", "value with spaces ") | ||
.unwrap(); | ||
table_config | ||
.set("format.metadata::key4", "value with special chars :: :") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
# accepts multiple entries with the same key (will overwrite) | ||
statement ok | ||
COPY source_table | ||
TO 'test_files/scratch/copy/table_with_metadata/' | ||
STORED AS PARQUET | ||
OPTIONS ( | ||
'format.metadata::key1' 'value', | ||
'format.metadata::key1' 'value' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there is no need to change it in this PR
ParquetSink
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
Which issue does this PR close?
Closes #10223 .
Related to #9493
Rationale for this change
Restore the ability to set user-inserted metadata into the writer properties of ParquetSink, and to enable this as a SQL level API configurable.
What changes are included in this PR?
Broken up per commit:
New feature added UPDATED syntax:
Are these changes tested?
Yes. We have both integration tests for the TableParquetOptions API, as well as sqllogictests for the SQL level API.
Are there any user-facing changes?
Yes. The TableParquetOptions API is extended, as well as new SQL level API for these options.