-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43624: [Go] Add JSON/UUID extension types, extend arrow -> parquet logical type mapping #43679
Conversation
Thanks for the thorough review @pitrou. The UUID extension was moved from a previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my very basic knowledge of Go, this looks ok to me. I'd rather @zeroshade vet it though :-)
go/arrow/extensions/json.go
Outdated
|
||
// GetOneForMarshal implements arrow.Array. | ||
func (a *JSONArray) GetOneForMarshal(i int) interface{} { | ||
return a.Value(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this instead use ValueJSON
and return the json.RawMessage
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes more sense!
go/arrow/extensions/json_test.go
Outdated
expectedJSON := strings.ReplaceAll(` | ||
{"json":"foobar"} | ||
{"json":null} | ||
{"json":{"foo":"bar"}} | ||
{"json":42} | ||
{"json":true} | ||
{"json":[1,true,"3",null,{"five":5}]}`, | ||
"\t", "") + "\n" // strip indentation, add trailing newline | ||
require.Equal(t, expectedJSON, buf.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably use JSONEq
instead of needing to do the ReplaceAll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RecordToJSON
returns jsonlines rather than json itself, but it is still cleaner to do a line-by-line JSONEq
comparison. I just added that.
*array.ExtensionBuilder | ||
} | ||
|
||
func NewUUIDBuilder(mem memory.Allocator) *UUIDBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed that, just added one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple nits to address first, otherwise I'm good with this
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit d47b305. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…arquet logical type mapping (apache#43679) ### Rationale for this change - Missing `JSON` extension type implementation. - Current precedent in C++ (and thereby PyArrow) is that canonical extension types do not require manual registration. - Issues like apache#43640 and apache#43624 suggest that we need to expose ways of configuring parquet types written from arrow records, but casting the underlying data presents challenges for a generalized approach. ### What changes are included in this PR? - Move `UUIDType` from `internal` to `arrow/extensions` - Implement `JSON` canonical extension type - Automatically register all canonical extension types at initialization - remove register/unregister from various locations these extension types are used - Add new `CustomParquetType` interface so extension types can specify their target `LogicalType` in Parquet - Refactor parquet `fieldToNode` to split up `PrimitiveNode` type mapping for leaves from `GroupNode` composition - Simplify parquet `LogicalType` to use only value receivers ### Are these changes tested? Yes ### Are there any user-facing changes? - `UUID` and `JSON` extension types are available to end users. - Canonical extension types will automatically be recognized in IPC without registration. - Users with their own extension type implementations may use the `CustomParquetType` interface to control Parquet conversion without needing to fork or upstream the change. * GitHub Issue: apache#43624 Authored-by: Joel Lubinitsky <[email protected]> Signed-off-by: Joel Lubinitsky <[email protected]>
#### Summary Goes with cloudquery/plugin-pb-go#445 Interesting bit is https://arrow.apache.org/blog/2024/10/23/arrow-go-18.0.0-release/#canonical-extension-types that we might be able to drop our UUID and JSON own extension types, but probably better to do that separately Had to do 3ae4778 to get the tests working, followed apache/arrow#43679 as a reference ---
Rationale for this change
JSON
extension type implementation.What changes are included in this PR?
UUIDType
frominternal
toarrow/extensions
JSON
canonical extension typeCustomParquetType
interface so extension types can specify their targetLogicalType
in ParquetfieldToNode
to split upPrimitiveNode
type mapping for leaves fromGroupNode
compositionLogicalType
to use only value receiversAre these changes tested?
Yes
Are there any user-facing changes?
UUID
andJSON
extension types are available to end users.CustomParquetType
interface to control Parquet conversion without needing to fork or upstream the change.