Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43624: [Go] Add JSON/UUID extension types, extend arrow -> parquet logical type mapping #43679

Merged
merged 21 commits into from
Aug 22, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Aug 13, 2024

Rationale for this change

What changes are included in this PR?

  • Move UUIDType from internal to arrow/extensions
  • Implement JSON canonical extension type
  • Automatically register all canonical extension types at initialization
    • remove register/unregister from various locations these extension types are used
  • Add new CustomParquetType interface so extension types can specify their target LogicalType in Parquet
  • Refactor parquet fieldToNode to split up PrimitiveNode type mapping for leaves from GroupNode composition
  • Simplify parquet LogicalType to use only value receivers

Are these changes tested?

Yes

Are there any user-facing changes?

  • UUID and JSON extension types are available to end users.
  • Canonical extension types will automatically be recognized in IPC without registration.
  • Users with their own extension type implementations may use the CustomParquetType interface to control Parquet conversion without needing to fork or upstream the change.

@joellubi joellubi changed the title GH-43624: [Go] Add Support for Custom Mapping b/w Arrow ExtensionType and Parquet LogicalType GH-43624: [Go] Add JSON/UUID extension types, extend arrow -> parquet logical type mapping Aug 14, 2024
@joellubi joellubi marked this pull request as ready for review August 14, 2024 19:58
@joellubi joellubi requested a review from zeroshade as a code owner August 14, 2024 19:58
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 15, 2024
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 15, 2024
@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 21, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 21, 2024
@joellubi
Copy link
Member Author

Thanks for the thorough review @pitrou. The UUID extension was moved from a previously internal part of the repo, but I clearly didn't do the diligence to confirm it did in fact meet the current specification. Let me know if you have any other thoughts.

@joellubi joellubi requested a review from pitrou August 22, 2024 12:21
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 22, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my very basic knowledge of Go, this looks ok to me. I'd rather @zeroshade vet it though :-)


// GetOneForMarshal implements arrow.Array.
func (a *JSONArray) GetOneForMarshal(i int) interface{} {
return a.Value(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this instead use ValueJSON and return the json.RawMessage instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes more sense!

Comment on lines 248 to 256
expectedJSON := strings.ReplaceAll(`
{"json":"foobar"}
{"json":null}
{"json":{"foo":"bar"}}
{"json":42}
{"json":true}
{"json":[1,true,"3",null,{"five":5}]}`,
"\t", "") + "\n" // strip indentation, add trailing newline
require.Equal(t, expectedJSON, buf.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably use JSONEq instead of needing to do the ReplaceAll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordToJSON returns jsonlines rather than json itself, but it is still cleaner to do a line-by-line JSONEq comparison. I just added that.

*array.ExtensionBuilder
}

func NewUUIDBuilder(mem memory.Allocator) *UUIDBuilder {
Copy link
Member

@zeroshade zeroshade Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a docstring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed that, just added one!

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nits to address first, otherwise I'm good with this

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Aug 22, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Aug 22, 2024
@joellubi joellubi merged commit d47b305 into apache:main Aug 22, 2024
26 of 28 checks passed
@joellubi joellubi removed the awaiting changes Awaiting changes label Aug 22, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit d47b305.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

mapleFU pushed a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
…arquet logical type mapping (apache#43679)

### Rationale for this change

- Missing `JSON` extension type implementation.
- Current precedent in C++ (and thereby PyArrow) is that canonical extension types do not require manual registration.
- Issues like apache#43640 and apache#43624 suggest that we need to expose ways of configuring parquet types written from arrow records, but casting the underlying data presents challenges for a generalized approach.

### What changes are included in this PR?

- Move `UUIDType` from `internal` to `arrow/extensions`
- Implement `JSON` canonical extension type
- Automatically register all canonical extension types at initialization
  - remove register/unregister from various locations these extension types are used
- Add new `CustomParquetType` interface so extension types can specify their target `LogicalType` in Parquet
- Refactor parquet `fieldToNode` to split up `PrimitiveNode` type mapping for leaves from `GroupNode` composition
- Simplify parquet `LogicalType` to use only value receivers

### Are these changes tested?

Yes

### Are there any user-facing changes?

- `UUID` and `JSON` extension types are available to end users.
- Canonical extension types will automatically be recognized in IPC without registration.
- Users with their own extension type implementations may use the `CustomParquetType` interface to control Parquet conversion without needing to fork or upstream the change.

* GitHub Issue: apache#43624

Authored-by: Joel Lubinitsky <[email protected]>
Signed-off-by: Joel Lubinitsky <[email protected]>
kodiakhq bot pushed a commit to cloudquery/plugin-sdk that referenced this pull request Dec 13, 2024
#### Summary

Goes with cloudquery/plugin-pb-go#445

Interesting bit is https://arrow.apache.org/blog/2024/10/23/arrow-go-18.0.0-release/#canonical-extension-types that we might be able to drop our UUID and JSON own extension types, but probably better to do that separately 

Had to do 3ae4778 to get the tests working, followed apache/arrow#43679 as a reference

---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants