Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does goavro deal with complex nested types? #102

Open
masonwheeler opened this issue Feb 8, 2018 · 13 comments
Open

How does goavro deal with complex nested types? #102

masonwheeler opened this issue Feb 8, 2018 · 13 comments
Labels

Comments

@masonwheeler
Copy link

Have a look at this StackOverflow answer, which depicts a single Avro schema as an array in which multiple record types are declared, which can reference each other.

This is an ideal format for describing complex structs that contain other structs, but it looks like goavro doesn't know how to support this. It expects any array to be a union type, so trying to register a codec for this wouldn't work. The obvious workaround would be to register them as distinct types, but doesn't appear that there's any way to do that either. While the codec building code uses a map internally to keep track of codecs for sub-types, it's created anew for each new codec and there doesn't appear to be a mechanism to say "create a new codec for this schema and here's a list of existing codecs for the struct types it references."

So what's the correct way to deal with structs that contain other structs (or slices of other structs)?

@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

Could you post on here a valid Avro schema that goavro does not parse, because if not, that is a bug. Or just post any schema you're trying to work with, and I'll be happy to take a look at it.

The goavro test suite does include all the nested schemas that are in the main Avro repository test suite. As far as I know, it does have a complete Avro binary and JSON translation capability, including handling all the test suite schemas.

@masonwheeler
Copy link
Author

@karrick Here's the example from the linked SO post:

[
{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "zip",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "phone",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer2",
    "fields": [
        {
            "name": "x",
            "type": "string"
        },
        {
            "name": "y",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]

Goavro treats this as a union type, rather than a list of schemas. If I wanted to use this to work with Customer objects, how would I do that?

@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

Forgive me if I say something you already know, but I'm just trying to make sure we are on the same page. I'm going to restate some of what you said so you can tell me where I'm understanding you wrongly.

The above JSON blob is not a valid Avro schema. An Avro schema does not involve step (1) define a bunch of data types, then (2) define data in terms of these types.

In Avro, one defines the data type of the top level object they intend to serialize. For most applications that data type is a record, with lots of fields. Although it could be a primitive data type, or any of the supported complex data type. In an Avro record there are fields, each of which have their own data types, which also might be complex or primitive.

Here's the cool part: When defining some data types, Avro allows a way to give various types a name, and then later refer to those data types again. For instance, if when defining an Employee record one defines one field to be a new Address record type, then later on a person can declare a different data type somewhere else also an Address merely by referencing its previously declared name.

It sounds like what you would like to do is pre-declare a bunch of data types in your version of a schema, and then later reference those while defining the data type you intend to serialize. While I agree that would be handy, is not how the Avro schema DSL works.

If you want to work with Avro data, using Avro schemas, this library can help you. I am quite partial to using goavro for working with Avro data, however it sounds like you'd like something that works differently than Avro. Maybe take a look at protobuf, which does allow you to declare data types like you describe.

@karrick karrick closed this as completed Feb 9, 2018
@karrick karrick reopened this Feb 9, 2018
@karrick karrick closed this as completed Feb 9, 2018
@masonwheeler
Copy link
Author

An Avro schema does not involve step (1) define a bunch of data types, then (2) define data in terms of these types.

With all due respect, you're wrong about this. The schema I posted works in Avro; it does not work in goavro. Also, the explicit capability to declare types completely separate from one another and then import them for use in other data types was added to Apache's official Avro implementation more than 5 years ago. It also does not appear to be supported by goavro.

This is a very important part of Avro schemas, and the ability to make use of it is a business requirement in the scenario I'm working on. Protobuf would be nice, but that's not the format that's been imposed upon our team from above.

@karrick karrick reopened this Feb 9, 2018
@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

Really, if so this is really cool and interesting. I could not see that in the current version of the Avro spec. Internally this version of goavro implements data types using a symbol table, and supporting what you describe would not be much of a stretch. My intention has always been to only support what's in the Avro spec, rather than extra things changed by various patches.

@masonwheeler
Copy link
Author

Not sure either way about the official spec--I'll freely admit to not having read it--but it's in Apache's official reference implementation, which means that there are schemas out there in use in real-world code that are using it, which goavro can't currently interoperate with.

@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

I really like this idea, and I think it's worth adding. In how I imagine implementing an equivalent feature, basically when creating a new goavro.Codec, one would be able to provide a symbol table of data type names to their own goavro.Codec instance. So the developer would obtain the schema bytes in whatever way is convenient, whether it's a pre-compiled constant or a in a file they would read, send those schemas to create goavro.Codec instances, and then pass that list to goavro.NewCodec in some form, which would simply pre-populate the symbol table with more than the primitive types that it does now.

@masonwheeler
Copy link
Author

I like that idea, especially since our schemas are coming from a Schema Registry and not from .avsc files on disc as the linked issue refers to.

There would probably need to be a new function, goavro.NewCodecEx or similar, that takes a map[string]*Codec as a parameter, to avoid making breaking API changes. (But please, Golang team, tell me more about how function overloading is something we don't want to implement!) You could probably extract most of NewCodec into a private newCodecImpl func that takes a schema name and a map. NewCodec would call this and pass newSymbolTable() as the map; NewCodecEx would merge the passed-in map with newSymbolTable() and pass that to newCodecImpl, and then the rest would probably work as-is.

@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

Not sure either way about the official spec--I'll freely admit to not having read it--but it's in Apache's official reference implementation, which means that there are schemas out there in use in real-world code that are using it, which goavro can't currently interoperate with.

I think this is the key point. While I agree that the feature is cool, there is no way to implement everyone's special Avro patch without adversely impacting performance and adherence to the actual specification.

If the feature is important enough to add to the official specification, then behavior will be defined and expected, or a bug, in which case it will be fixed. When the feature is not described in the specification, then it's implementation is accidental, and is subject to change from release to release.

@masonwheeler
Copy link
Author

Perhaps, but now we're getting into philosophy. I tend to take the opposite view: the reference implementation is the true spec, because it, not the paper spec, defines how the system that people are actually using works. When its behavior is in conflict with the paper spec, except in cases of obvious bugs, the paper spec is incorrect and needs to be updated. (Non-obvious bugs, on the other hand, have a tendency to become de facto features that people end up relying on, that eventually have to be made official in one way or another. Raymond Chen has plenty of great stories about this phenomenon on his blog.)

@karrick
Copy link
Contributor

karrick commented Feb 9, 2018

As a pragmatic individual, I agree that the reference implementation is the defacto spec, but it's a poor quality one. In a whim the defacto spec can change, without any discussion or coordination with the community, causing hundreds of engineer hours to update other systems to work with the new defacto spec, when it was not even important enough to update the actual spec for.

Accidents in reference implementations should not cause this. That is irresponsible.

@JasonSanchez
Copy link

The usecase @masonwheeler described seems to conform to the reference implementation.

Under Schema Declaration, allowed is "A JSON array, representing a union of embedded types."

It is important to realize that what @masonwheeler is doing is not importing a type, referencing multiple types, or doing anything fancy.

Instead, the example schema creates a union of three embedded types. It is just saying that the data can be any of these three types. It also uses "the cool part" of Avro @karrick mentioned above to reuse types previously defined.

Basically, it is saying there are three possible types the data could be where the third type references the other two types. In reality, the data will never be the first two types and these types only exist to have the feeling of better modularization when defining the third type in the union.

@kklipsch
Copy link
Contributor

I'm reasonably certain that is a valid JSON schema as @JasonSanchez mentions it falls in the third point under https://avro.apache.org/docs/1.8.2/spec.html#schemas this goes back to the 1.0 specification.

Personally I think its an unfortunate way to define types because it means that any of the union are valid, which is likely not what you mean. But it seems to be the only way to have disparate names without embedding the definitions inside of the schemas as they are used.

Most of the schema parser libraries out there allow for adding names to the namespace of the parser library. I don't know if you want to support it in goavro or not, up to you. But it is incredibly useful to be able to break out schemas by file and not have embedded types.

A preprocess object is another approach. One that parsed the schemas and embedded the namespaces second. That might get complicated with cyclical/recursive definitions though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants