Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to push avro schemas/protocols such that it registry uses references and doesn't embed everything into definition #5573

Open
lsegv opened this issue Nov 22, 2024 · 11 comments

Comments

@lsegv
Copy link

lsegv commented Nov 22, 2024

I want to build a custom registry image based on apicurio, this image would preload all my schemas enforcing them (since i will disable pushing artifacts by services). There is one big problem though.

If i simply loop over all my AVSC files (that i generate from AVDL) and upload them one by one then registry does no processing on top of them and does not recognise when 2 messages were imported reused, because the content of AVSC is literally inlined to be self contained.

I started looking for api that would allow me to upload protocol files AVDL or AVPR, but i found no such thing.

When i use a kafka producer/consumer example and let it push the definitions on the go i see that this pushes messages properly (e.g. uses references), but actual code that does this does a lot of work, it figures out all those references and uploads them correctly to registry.

Problem is i cant just let arbitrary java code run during build stage (or at least i dont want to), ideally registry should allow for importing a protocol and then properly store all those definitions and references.

What can i do here other than grabbing the code from kafka serializer? maybe there is an api i do not know about?

@apicurio-bot
Copy link

apicurio-bot bot commented Nov 22, 2024

Thank you for reporting an issue!

Pinging @jsenko to respond or triage.

@EricWittmann
Copy link
Member

What are your expectations for how your schemas are laid out locally? Do you have control over that such that you could maintain some extra metadata?

Another place to look for this type of thing is in our maven plugin. Especially this part:

https://github.com/Apicurio/apicurio-registry/blob/main/utils/maven-plugin/src/main/java/io/apicurio/registry/maven/RegisterRegistryMojo.java#L130-L138

What do you mean precisely when you say this?

this image would preload all my schemas enforcing them

I'd like to better understand your use-case/goals.

There is not currently a way to send a bunch of related stuff to registry all at once and have it automatically figure out the details (with references and all that). It is something we've discussed, but not yet implemented. Could be an opportunity to collaborate on something like that if you are interested.

@lsegv
Copy link
Author

lsegv commented Nov 25, 2024

Regarding "extra metadata" yes we have full control over whatever we are doing with avro, i assume you mean metadata in the message definitions themselves?

The use case is like this: in production we do not want to let different services to push their message definitions at will, which means schema registry will have to be in its correct state (with all versions of schemas) when its running, there are several reasons for that.

  1. we have 400+ message types, and devs dont want to let all that happen at runtime, they would rather prefer to know that once the SR is started all schemas are there ready to be queried.

  2. we have build pipeline tasks that run current avro schemas committed in current branch against "official" schema registry for that target environment, and if it detects violation on compatibility requirements it will fail the build with proper messages.

  3. packing all specific versions of schemas in the custom docker image of schema registry means we know exactly what messages were in use (and what was the last version) at the time things were running for that specific release.

It all basically boils down to us wanting to have deterministic behaviour with registry, if we are rolling back we can just grab an older image and know that a service that shouldnt have known about v4 of some message wont magically figure it out because it was pushed in registry earlier... we can simply wipe the db, deploy the specific docker image and have exact same behaviour.

in dev/qa we let devs go wild and push however they like, in prod services cant do that, messages should be preloaded by the docker image at stratup.

Would be happy to collaborate, i think all of my problems are solved if we can come up with a new rest endpoint that lets us import AVDL or AVPR files. I'll be glad to contribute as well, just need some pointers on how you'd approach this.

P.S. the unions not working was because we were importing all types directly and not using references, once i let kafka clients register in an empty registry all unions worked fine.

@jsenko
Copy link
Member

jsenko commented Dec 9, 2024

We have been planning on adding gitops support for some time now. Would that help solve the issue? If the data has to be baked into an image, I can think of a variant of gitops with the "repository" baked into the image.

@EricWittmann
Copy link
Member

Hi @lsegv thank you for the additional context. Much appreciated, and sorry for the delay in responding. We've being trying to get a release out the door. 😓

Admittedly we have not previously considered a use case that bundled the artifacts into a container image. But it's an interesting idea, for the reasons you've given - essentially release immutability.

Registry already has a feature that might be leveraged to achieve part of what you want to do:

https://github.com/Apicurio/apicurio-registry/blob/main/app/src/main/java/io/apicurio/registry/ImportLifecycleBean.java

The ImportLifecycleBean imports content into the registry on startup. That feature, combined with the new "read only" feature, would result in an immutable registry instance contained in a custom container image:

https://github.com/Apicurio/apicurio-registry/blob/main/app/src/main/java/io/apicurio/registry/storage/decorator/ReadOnlyRegistryStorageDecorator.java#L41

(Side note: the read-only property is dynamic, which means it can be set via ENV var but also can be changed at runtime via REST API call. However, the latter capability can also be configured, resulting in a registry that is read-only on startup and is unable to be changed).

Using this feature would require all of the artifacts to be bundled into a .zip file of the proper format (the format created by Registry during an "export" operation). Creating the .zip file then becomes the challenge.

@EricWittmann
Copy link
Member

Would be happy to collaborate, i think all of my problems are solved if we can come up with a new rest endpoint that lets us import AVDL or AVPR files. I'll be glad to contribute as well, just need some pointers on how you'd approach this.

Can you explain this a bit more (or maybe link to some required reading)? I confess to not being as much of an expert on Avro as I could be.

@jsenko
Copy link
Member

jsenko commented Dec 9, 2024

Could you also please explain this part a bit more?

then registry does no processing on top of them and does not recognise when 2 messages were imported reused, because the content of AVSC is literally inlined to be self contained.

@lsegv
Copy link
Author

lsegv commented Dec 10, 2024

We have been planning on adding gitops support for some time now. Would that help solve the issue? If the data has to be baked into an image, I can think of a variant of gitops with the "repository" baked into the image.

I'm not sure how the gitops solution you describe works but basically anything that would allow us to have deterministic images (effectively snapshots) of what message versions were at a specific point would be great.

@EricWittmann right now as POC what we do is the docker image is built on top of SR 3.0.0, it copies over schema files and uses a script to call the rest api and import all the schemas... but as i said the problem is those schemas are inlined, they are not using references (which is why i believe the unions do not work)... and yes once it loads all the schemas it uses the REST api to close publishing over rest.

For the near future i'm going to simply use on demand push of message types while we integrate SR into existing mess. But in the long run i have to solve the problem of creating immtuable versions of SR based on given .avdl files. The content in the SR has to be in the same shape as it would be when you run a real application (e.g. use references instead of completely independent messages that inline any references).

I'll take a look at ImportLifecycleBean see how it can be used.

Jsenko i'll show in the next reply with some screenshots...

@lsegv
Copy link
Author

lsegv commented Dec 10, 2024

lseg-avro-poc.zip

I'm attaching an poc zip, you should be able to easily run it with java 17 vm of any kind,
it has two modules
schema (where all the avro related stuff happens)
poc (an application that would use avro and SR)

once you do mvn package look under /schema/target/avro and you will know what it does...

If you run services.yml (thats the simplest form, with in memory db and no extra stuff), it will use your official image with blank state and let application push message definitions on demand... in which case you will see the following in SR ui once you launch producer/consumer (run configs should be in the project). Make sure to uncomment whichever you want to play with,

  apicurio-registry:
    image: apicurio/apicurio-registry:3.0.0
    #image: lseg-schema-registry:latest #this is the thing we build to preload schemas and set the compatibility rules

so running apicurio blank image + producer + consumer you will see this, the high level message Msg is properly registered and uses references to T1,T2 etc... and both producer consumer are happily working without any issues.

1

you also see that it is not inlining the definitions of the messages such as T1 into MSG...
2

Now look at dockerfile i have there, it simply copies the avsc definitions (cos i couldn't feed SR anything else trough rest api, imo it should be able to accept avdl or avpr...) then it uses load-schemas.sh script to populate messages + set the compatibility rules and lock publishing... yes its dirty but its poc, technically some service could be doing something while this init is working (but i'll worry about it later).

Lets build it

cd schema
docker build -t lseg-schema-registry .

modify the services.yaml to have this

  apicurio-registry:
#    image: apicurio/apicurio-registry:3.0.0
    image: lseg-schema-registry:latest

When you run again with the newly built image that preloads these types, you will see that it's no longer blank even before producer consumer are launched you will see the definitions and global rules set.

But now we use no references, and all messages are inlined (because its literally what avro compiler spits out), and SR seem to do no reasoning on import, and I'm not saying it should but if not then I should have the option to import it some other way no?

3
4

Verify that the SR is in read only mode and launch producer and consumer again.

The custom image tries to set that configuration but for some reason it does not work or I do something wrong as a last resort you can easily modify the producer to not push schemas with .kafkaProducerConfig(createArtifactsIfNotFound = false). But i suggest to open ui before running them and manually flip the option.

6

Now if you run the producer and consumer they will die... like so

7

Note they will run just fine if i were to send only T1 or some specific type instead of the union, but if i simply preload the schemas as avro spits them out i cannot use them, i am forced to let apicurio push messages on demand in digested form with proper references.

Error message here does not seem very helpful, since i made 0 code change and only changed how the message definitions are delivered to SR... my guess is this happens with unions because T1 on its own is a completely different message than T1 in MSG, but even then... producer sent Msg.T1 not directly T1, consumer is literally reading Msg.T1 it should work imo...

@EricWittmann
Copy link
Member

Thanks for all of this additional information and the examples and POC! This is very helpful to understand what you're trying to do. We'll discuss this as soon as we get a chance. 👍

@lsegv
Copy link
Author

lsegv commented Dec 12, 2024

Btw the class LsegRecordIdStrategy is some hack i had to do to make sure all messages are declared with the same format otherwise youd see

MSG
com.lseg.avro.T1
com.lseg.avro.T2
com.lseg.avro.T3

One of the messages would not have the package.
I'll help if you have any questions or suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants