Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify language around user-defined types #604

Merged
merged 6 commits into from
Apr 17, 2024

Conversation

vbarua
Copy link
Member

@vbarua vbarua commented Feb 21, 2024

No description provided.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
```yaml
name: u32
structure: "FIXEDBINARY<4>"
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, if the intent of the structure field is to provide a way for systems that do not support the user-define type to create and transfer values of it, this won't work very well because the systems in questions won't have enough information to create a binary version of that type.

That being said, implicitly if there is no structure and a system without support for a type is given a value of that type, the only thing that system can do is pass the value around as a binary blob.

@vbarua
Copy link
Member Author

vbarua commented Feb 21, 2024

Context: https://substrait.slack.com/archives/C02D7CTQXHD/p1707328717290129

Inlining for posterity

Victor Barua
I'm trying to understand some of the language around User-Defined Types here and how it related to literal value in the protobufs.
"The structure field of a type is only intended to inform systems that don’t have built-in support for the type how they can transfer the data type from one point to another without unnecessary serialization/deserialization and without loss of type safety."
and
"The structure field is optional. If not specified, the type class is considered to be fully opaque. This implies that a systems without built-in support for the type cannot manipulate values in any way, including moving and cloning."
Questions I have in no particular order:

  1. Does the struct definition define a particular encoding in a system, or does it just define the wire format?
  2. Why does having a structure field allow a system to transfer data without unnecessary serde? My assumption is that without structure field, a system can serde an UDT all and can only pass them around as opaque blobs.
  3. If the struct is not specified, why can't a system clone and move values? Relatedly, what does cloning and moving mean in a system?

David Sisson
We've never enforced the encoding in a system before so I'd believe it's just the wire format. The structure field should obviate the need for serde when being passed to another Substrait consumer but it's up to that consumer to serde it if it doesn't fit their internal format. Not sure why clone and move are off limits -- a Substrait gateway could just pass along the any protobuf along without touching it just fine.
An example of a UDT is the geometry extension -- instead of keeping track of a complicated type such as a shape with multiple lines it's just an opaque type with routines for querying more specific type information.

Victor Barua
"a Substrait gateway could just pass along the any protobuf along without touching it just fine."
That was what I was thinking, hence confusion around the clone and move stuff. Someone at work is working on serializing some of our user-defined internal types and they and I realised we had opposite interpretations of some of the wording here 😓

Weston Pace
The PR adding these statements simply refers to a slack discussion which is beyond the slack history 😢

@vbarua vbarua requested a review from jvanstraten February 21, 2024 00:18
@vbarua vbarua changed the title docs: clarify language around user-define types docs: clarify language around user-defined types Feb 21, 2024
EpsilonPrime
EpsilonPrime previously approved these changes Feb 22, 2024
site/docs/types/type_classes.md Show resolved Hide resolved
@vbarua vbarua force-pushed the vbarua/user-defined-type-thonks branch from bf21fd6 to bd5956d Compare March 28, 2024 21:36
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
@vbarua
Copy link
Member Author

vbarua commented Apr 16, 2024

buf is complaining about

proto/substrait/algebra.proto:895:9:Field "2" on message "UserDefined" moved from outside to inside a oneof.

I ran a couple of quick checks on this.

First, decoding a binary encoded version of the message against the old and new protobufs:

substrait git:(main) echo 8a023612340a2e747970652e676f6f676c65617069732e636f6d2f676f6f676c652e70726f746f6275662e496e74363456616c75651202082a | xxd -r -p | protoc --proto_path=proto --decode substrait.Expression.Literal proto/substrait/algebra.proto
user_defined {
  value {
    type_url: "type.googleapis.com/google.protobuf.Int64Value"
    value: "\010*"
  }
}

substrait git:(vbarua/user-defined-type-thonks) echo 8a023612340a2e747970652e676f6f676c65617069732e636f6d2f676f6f676c652e70726f746f6275662e496e74363456616c75651202082a | xxd -r -p | protoc --proto_path=proto --decode substrait.Expression.Literal proto/substrait/algebra.proto
user_defined {
  value {
    type_url: "type.googleapis.com/google.protobuf.Int64Value"
    value: "\010*"
  }
}

The decoded message is the same across both version. Checking the JSON, both the old and new protobufs generate the same JSON

{
  "userDefined": {
    "typeReference": 0,
    "value": {
      "@type": "type.googleapis.com/google.protobuf.Int64Value",
      "value": "42"
    },
    "typeParameters": []
  },
  "nullable": false,
  "typeVariationReference": 0
}

The rule it is triggering is FIELD_SAME_ONEOF.

This checks that no field moves into or out of a oneof or changes the oneof it's a part of. Doing so is almost always a generated source code breaking change. Technically there are exceptions with regard to wire compatibility, but the rules are complex enough that it's safer to never change a field's presence inside or outside a given oneof.

In this case, following the link shows that we have something similar to one of the exception cases:

Move fields into or out of a oneof: You may lose some of your information (some fields will be cleared) after the message is serialized and parsed. However, you can safely move a single field into a new oneof and may be able to move multiple fields if it is known that only one is ever set.

Effectively the issue that they are guarding against is moving a field into a oneof, and then if that field is set it will clear other messages in the oneof. In this case, as there is only 1 existing field (value) and we are adding a new field (struct), there should be no existing code that tries to set both.

IMO we can ignore this for this PR.

@vbarua vbarua requested a review from EpsilonPrime April 16, 2024 01:24
Copy link
Member

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be safe (even the buf guidance says it can be but it is just doing the easy check instead).

@vbarua vbarua merged commit 39e3ebb into main Apr 17, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants