feat: introduce JSON_SR format #4596

agavra · 2020-02-20T00:59:35Z

Description

Introduce a new format to be able to read the data that is produced by confluent's json serializers that append a magic byte and the schema ID to standard JSON. Note that we can't just use the standard deserializer because we require the mapper to use USE_BIG_DECIMAL_FOR_FLOATS to avoid deserializing

Testing done

Unit testing and local testing.

NOTE: still needs support for print topic

ksql> CREATE STREAM json_sr (col1 VARCHAR, col2 INT) WITH (kafka_topic='json_sr', value_format='JSON_SR', partitions=1);

 Message
----------------
 Stream created
----------------
ksql> INSERT INTO json_sr (col1, col2) VALUES ('foo', 1);
ksql> PRINT json_sr FROM BEGINNING;
Key format: UNDEFINED
Value format: KAFKA (STRING)
rowtime: 2/19/20 3:33:07 PM PST, key: <null>, value: {"COL1":"foo","COL2":1}
+----------------------+----------------------+----------------------+----------------------+
|ROWTIME               |ROWKEY                |COL1                  |COL2                  |
+----------------------+----------------------+----------------------+----------------------+
|1582155187492         |null                  |foo                   |1                     |

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

hjafarpour · 2020-02-20T17:41:48Z

@agavra seems that incorrect value format is shown in the output. We have JSON_SR but KAFKA (String) is shown:

ksql> PRINT json_sr FROM BEGINNING;
Key format: UNDEFINED
Value format: KAFKA (STRING)
rowtime: 2/19/20 3:33:07 PM PST, key: <null>, value: {"COL1":"foo","COL2":1}
+----------------------+----------------------+----------------------+----------------------+
|ROWTIME               |ROWKEY                |COL1                  |COL2                  |
+----------------------+----------------------+----------------------+----------------------+
|1582155187492         |null                  |foo                   |1

agavra · 2020-02-20T17:52:47Z

@hjafarpour - yes, see the note right above 😂 "NOTE: still needs support for print topic"

purplefox · 2020-02-25T10:56:03Z

ksql-serde/src/main/java/io/confluent/ksql/serde/json/JsonSerdeUtils.java

+   *         using the schema registry format (first byte magic byte, then
+   *         four bytes for the schemaID).
+   */
+  public static InputStream asStandardJson(@Nonnull final byte[] jsonWithMagic) {


Why does this need to return an InputStream? It seems that all the callers really just want the bytes.

Would it be simpler (and more efficient) just to slice the byte[] and return another one with the prefix removed?

Also.. I know this InputStream won't block, but as we move to a more reactive / non blocking model in the server we should avoid using blocking constructs such as input and output streams and favour buffers.

I think Almog is trying to avoid the array copy for every message when deserializing. This could be premature optimisation, but I'm guessing it will hurt performance as we already know serialization is a large cost to us.

An alternative to copying the data into a new buffer would be to build a parser with the original buffer and with the suitable offset:, i.e. change a line like:

// Original code that deserialized the whole byte array: MAPPER.readTree(bytes);

to

// Build a parser with the whole array and appropriate offset into that array: // This avoids the array copy at the cost of slightly more complex code. final int offset = isJsonSchema ? JsonSerdeUtils.SIZE_OF_SR_PREFIX : 0; MAPPER.readTree(MAPPER.getFactory().createParser(bytes, offset, bytes.length - offset));

Copying small arrays is super fast in Java. I would bet that it would be a lot faster than constructing an InputStream around it and reading bytes one by one.

Also if you use ByteBuffer or Vert.x Buffers you can avoid the copy altogether as slice just references the original array with different offset and length.

replaced with:

return mapper.readValue( jsonWithMagic, SIZE_OF_SR_PREFIX, jsonWithMagic.length - SIZE_OF_SR_PREFIX, clazz );

Inside the method. I believe this should work.

big-andy-coates

Thanks @agavra

I've got to admit, I really dislike the fact we've got magic bytes in our JSON data :(

So we've now got the JSON, which should just be pure JSON text, and JSON_SR which is JSON data prefixed with SR data. This makes sense, even if it's fecking ugly!

However, we've other formats that utilise the schema registry and I think we should be using consistent naming. i.e. AVRO should really be AVRO_SR and PROTOBUF should be PROTOBUF_SR, as they are both prefixed data. This consistent naming will make things much easier for the user to understand. In the future we may want to support pure AVRO or PROTOBUF without SR integration. We should make such a change now, as it will only get harder later. cc @derekjn, @MichaelDrogalis for a Product perspective on this.

Other thoughts on this JSON_SR format:

Intermediate and sink topics will inherit this schema, meaning they'll be schemas being added to the schema registry. For Avro we clean these up when things are deleted. Is this happening for the other schema registry enabled formats?
There doesn't seem to be any doc updates for this yet. Are they coming later?

Also, have we considered any other approaches to handling the SR integration, rather than appending _SR to our formats? Thinking out loud, we could have a WITH property that controlled the integration?

Don't we still also encode the schema Id in AVRO_SCHEMA_ID in the WITH clause? We should also change this to SCHEMA_ID as part of this work.

ksql-serde/src/main/java/io/confluent/ksql/serde/json/JsonSerdeUtils.java

ksql-functional-tests/src/test/resources/query-validation-tests/elements.json

ksql-serde/src/main/java/io/confluent/ksql/serde/FormatFactory.java

ksql-serde/src/main/java/io/confluent/ksql/serde/json/JsonSchemaFormat.java

big-andy-coates · 2020-02-25T15:26:47Z

ksql-serde/src/main/java/io/confluent/ksql/serde/json/JsonSerdeUtils.java

+   *         using the schema registry format (first byte magic byte, then
+   *         four bytes for the schemaID).
+   */
+  public static InputStream asStandardJson(@Nonnull final byte[] jsonWithMagic) {


I think Almog is trying to avoid the array copy for every message when deserializing. This could be premature optimisation, but I'm guessing it will hurt performance as we already know serialization is a large cost to us.

An alternative to copying the data into a new buffer would be to build a parser with the original buffer and with the suitable offset:, i.e. change a line like:

// Original code that deserialized the whole byte array: MAPPER.readTree(bytes);

to

// Build a parser with the whole array and appropriate offset into that array: // This avoids the array copy at the cost of slightly more complex code. final int offset = isJsonSchema ? JsonSerdeUtils.SIZE_OF_SR_PREFIX : 0; MAPPER.readTree(MAPPER.getFactory().createParser(bytes, offset, bytes.length - offset));

agavra · 2020-02-25T18:52:57Z

Thanks for the review @big-andy-coates

However, we've other formats that utilise the schema registry and I think we should be using consistent naming. i.e. AVRO should really be AVRO_SR and PROTOBUF should be PROTOBUF_SR, as they are both prefixed data. This consistent naming will make things much easier for the user to understand. In the future we may want to support pure AVRO or PROTOBUF without SR integration. We should make such a change now, as it will only get harder later. cc @derekjn, @MichaelDrogalis for a Product perspective on this.

The idea is that this format is "temporary". When schema registry implements support for headers-based serialization instead of magic bytes we will remove JSON_SR and have all formats be just JSON. At that point, we will also be able to support vanilla AVRO and vanilla PROTOBUF without introducing new formats.

I understand what you're getting at, but I don't think renaming the formats today has an ROI that justifies that backwards incompatible change.

* Intermediate and sink topics will inherit this schema, meaning they'll be schemas being added to the schema registry.  For Avro we clean these up when things are deleted. Is this happening for the other schema registry enabled formats?

Good call, I'll double check this.

* There doesn't seem to be any doc updates for this yet.  Are they coming later?

Stay tuned :)

Also, have we considered any other approaches to handling the SR integration, rather than appending _SR to our formats? Thinking out loud, we could have a WITH property that controlled the integration?

I think they are different formats, and as such, should be treated as different formats. A WITH clause would (in my opinion) be more confusing but I'm happy to open that up to product folk (@derekjn)

Don't we still also encode the schema Id in AVRO_SCHEMA_ID in the WITH clause? We should also change this to SCHEMA_ID as part of this work.

Already opened a ticket for this #4556

derekjn · 2020-02-26T02:18:45Z

I've got to admit, I really dislike the fact we've got magic bytes in our JSON data :(

@big-andy-coates big +1 from me :(

I think they are different formats, and as such, should be treated as different formats. A WITH clause would (in my opinion) be more confusing but I'm happy to open that up to product folk (@derekjn)

I'm in agreement with @agavra that these should be treated as different formats, especially if this specific format is an interim solution until magic bytes are moved to the message header. Adding a WITH parameter is an interesting approach here but I don't feel that it would be worth it given that this basically amounts to an interim workaround. It may also lead users to believe that other established value formats could be augmented using a WITH modifier.

I'm in favor of keeping this as simple and isolated as possible by just introducing the single format JSON_SR.

agavra · 2020-02-26T18:43:15Z

Intermediate and sink topics will inherit this schema, meaning they'll be schemas being added to the schema registry. For Avro we clean these up when things are deleted. Is this happening for the other schema registry enabled formats?

@big-andy-coates I double checked this, and they are cleaned up (it's independent of whether or not it's an AVRO topic)

vpapavas · 2020-02-26T20:29:09Z

ksql-serde/src/main/java/io/confluent/ksql/serde/json/KsqlJsonDeserializer.java

  private String target = "?";

  public KsqlJsonDeserializer(
-      final PersistenceSchema physicalSchema
+      final PersistenceSchema physicalSchema,
+      final boolean isJsonSchema
  ) {
    this.physicalSchema = JsonSerdeUtils.validateSchema(physicalSchema);


Null check here?

I've got another PR following up this one to fix loose ends and will make sure to add it there! (I want to get this in since the build is green)

Thanks for the review @vpapavas!

vpapavas

Thank you @agavra! LGTM since you plan to add documentation and fix the print topic.

agavra · 2020-02-26T20:57:00Z

@big-andy-coates, I'm going to go ahead and merge this because it seems most of your points of contention are about product considerations which I hope @derekjn cleared up.

Given we still have some time before code freeze/release, if you feel strongly that this isn't the right way to do it let's sync up offline and come to an agreement on the best way forward and I'll address it in a future PR.

agavra force-pushed the json_schema branch 3 times, most recently from 479c786 to 1a8cc29 Compare February 20, 2020 22:30

agavra marked this pull request as ready for review February 20, 2020 23:24

agavra requested a review from a team as a code owner February 20, 2020 23:24

purplefox reviewed Feb 25, 2020

View reviewed changes

big-andy-coates mentioned this pull request Feb 25, 2020

chore: support PROTOBUF for PRINT TOPIC #4594

Merged

2 tasks

big-andy-coates reviewed Feb 25, 2020

View reviewed changes

agavra added 3 commits February 26, 2020 10:22

feat: introduce JSON_SR format

68e092e

fix: address andy/tims comments

149c74a

test: more tests

536fac3

agavra force-pushed the json_schema branch from 1a8cc29 to 536fac3 Compare February 26, 2020 18:31

test: add generated plan

1e8a296

vpapavas reviewed Feb 26, 2020

View reviewed changes

vpapavas approved these changes Feb 26, 2020

View reviewed changes

agavra merged commit daa04d2 into confluentinc:5.5.x Feb 26, 2020

agavra deleted the json_schema branch February 26, 2020 20:57

agavra mentioned this pull request Feb 26, 2020

test: update tests from 5.5.x merge #4645

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce JSON_SR format #4596

feat: introduce JSON_SR format #4596

agavra commented Feb 20, 2020

hjafarpour commented Feb 20, 2020

agavra commented Feb 20, 2020

purplefox Feb 25, 2020

big-andy-coates Feb 25, 2020

purplefox Feb 25, 2020 •

edited

Loading

purplefox Feb 25, 2020

agavra Feb 25, 2020

big-andy-coates left a comment •

edited

Loading

big-andy-coates Feb 25, 2020

agavra commented Feb 25, 2020

derekjn commented Feb 26, 2020

agavra commented Feb 26, 2020

vpapavas Feb 26, 2020

agavra Feb 26, 2020

vpapavas left a comment

agavra commented Feb 26, 2020

feat: introduce JSON_SR format #4596

feat: introduce JSON_SR format #4596

Conversation

agavra commented Feb 20, 2020

Description

Testing done

Reviewer checklist

hjafarpour commented Feb 20, 2020

agavra commented Feb 20, 2020

purplefox Feb 25, 2020

Choose a reason for hiding this comment

big-andy-coates Feb 25, 2020

Choose a reason for hiding this comment

purplefox Feb 25, 2020 • edited Loading

Choose a reason for hiding this comment

purplefox Feb 25, 2020

Choose a reason for hiding this comment

agavra Feb 25, 2020

Choose a reason for hiding this comment

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

big-andy-coates Feb 25, 2020

Choose a reason for hiding this comment

agavra commented Feb 25, 2020

derekjn commented Feb 26, 2020

agavra commented Feb 26, 2020

vpapavas Feb 26, 2020

Choose a reason for hiding this comment

agavra Feb 26, 2020

Choose a reason for hiding this comment

vpapavas left a comment

Choose a reason for hiding this comment

agavra commented Feb 26, 2020

purplefox Feb 25, 2020 •

edited

Loading

big-andy-coates left a comment •

edited

Loading