Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 578: Find schema version by fingerprint of schema text. #579

Merged
merged 3 commits into from
Aug 26, 2019
Merged

Issue 578: Find schema version by fingerprint of schema text. #579

merged 3 commits into from
Aug 26, 2019

Conversation

joaopedroantonio
Copy link
Contributor

Adds a new REST endpoint to find a schema version by the fingerprint of
the schema text.

Adds a new REST endpoint to find a schema version by the fingerprint of
the schema text.
@raju-saravanan
Copy link
Collaborator

@joaopedroantonio Thanks for showing interest in the project. May I know the circumstances that you are trying to address with this PR?

@joaopedroantonio
Copy link
Contributor Author

@raju-saravanan Sure, I can give you some context. At DefinedCrowd we're improving our event sourcing infrastructure and as part of that initiative we want to add Schema Registry to store our schema definitions.

For critical services, we want to register new schema versions at deploy time and publish messages without having to contact the Schema Registry in runtime. To accomplish this, instead of using the schema version id in the message, we can use the schema text fingerprint.

This has a couple downsides, the increase in the message size and the (remote) possibility of having a hash collision but it's a trade off we're willing to make for increased resiliency and easier scaling in critical parts of our architecture.

@raju-saravanan
Copy link
Collaborator

@joaopedroantonio So will you be exposing new ser des protocol in SerDesProtocolHandlerRegistry to use fingerprint to serialize and deserialze message ? We may have a scenario where we have two schema version with same fingerprint esp when two schemas are canonically same but have different default values for certain fields in the schema, in that case with the current implementation - which picks up a first schema from the result set - deserailized message might not be the one a use expects. We can resolve this by sorting the results set and pick up with schema version with latest timestamp. Will you be ok with this ? @satishd and @michaelandrepearce can you guys share your opinions on this ?

@joaopedroantonio
Copy link
Contributor Author

@raju-saravanan I wasn't thinking about implementing a new protocol, as we'll be using the Schema Registry in our scope solely to store and manage the schemas. My suggestion is that we keep this PR simply for exposing the REST endpoint and I'll create a new one to implement the new Protocol. What you think?

Regarding the scenario of having two schema version with the same fingerprint, I agree with returning the latest schema version rather than the first or we could even make this endpoint return all of them. I'm find with both options. What you think?

@raju-saravanan
Copy link
Collaborator

raju-saravanan commented Aug 7, 2019

@joaopedroantonio:
I have answered your question below:

  • Let's come back to implementing a new protocol if somebody else in the community needs it.
  • For two schemas with the same fingerprint let's return the schema version with the latest timestamp.

I have couple of more questions,

  • Just to be clear, you will not be using schema registry during serialization and deserialization of messages. The API is only for your reference/debugging purpose to see what is the corresponding schema version for a given fingerprint.
  • Are you planning on exposing this API on schema registry client ?
  • Can you test a write a test case for the API you have exposed ?

@joaopedroantonio
Copy link
Contributor Author

joaopedroantonio commented Aug 7, 2019

@joaopedroantonio:
I have answered your question below:

  • Let's come back to implementing a new protocol if somebody else in the community needs it.
  • For two schemas with the same fingerprint let's return the schema version with the latest timestamp.

I have couple of more questions,

  • Just to be clear, you will not be using schema registry during serialization and deserialization of messages. The API is only for your reference/debugging purpose to see what is the corresponding schema version for a given fingerprint.

Exactly, we'll handle serialization/deserizalization on our side (at least for now).

  • Are you planning on exposing this API on schema registry client ?

Wasn't planning to but I'll take a look into it and might end up implementing it for the sake of fun. I'll keep it on a separate PR though.

  • Can you test a write a test case for the API you have exposed ?

Yes, I'll do it.

I'll try to update this PR until the end of the week with the two changes we've discussed:

  1. Returns the schema version with the latest timestamp.
  2. Write a test case for the API exposed.

…rsion found.

- Add unit tests to validate find by fingerprint feature.
@joaopedroantonio
Copy link
Contributor Author

@raju-saravanan implemented the changes discussed previously, let me know your thoughts. :)

@@ -82,6 +82,15 @@ default SchemaVersionInfo getSchemaVersionInfo(String schemaName, String schemaT
*/
SchemaVersionInfo getSchemaVersionInfo(String schemaName, String schemaText, boolean disableCanonicalCheck) throws SchemaNotFoundException, InvalidSchemaException, SchemaBranchNotFoundException;

/**
* If there is a version of the schema with the given fingerprint then it returns the respective {@link SchemaVersionInfo}, else it returns null.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "else it returns null" can we replace it with "else it throws SchemaNotFoundException"?

LOG.warn(String.format("Multiple schemas found for the same fingerprint: %s", fingerprint));
}

return schemas.stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sorting the schemas in-memory, can we get sorted schemas from the database and pick the first one? You can have a look at this code which does this.

@raju-saravanan
Copy link
Collaborator

@joaopedroantonio : Left few more comments to address.

- Sort schemas by timestamp in database, instead of in memory.
- Improve javadoc.
@joaopedroantonio
Copy link
Contributor Author

joaopedroantonio commented Aug 21, 2019

Thank you for the comments @raju-saravanan, I think they're addressed. :)

Copy link
Collaborator

@raju-saravanan raju-saravanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@raju-saravanan raju-saravanan merged commit 72b00ce into hortonworks:master Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants