From 0a893192f819d6bee0e38f5a96c1b8ba7ce37f77 Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Tue, 6 Jul 2021 10:30:33 -0700 Subject: [PATCH 1/6] docs: klip-52: bytes data type support --- design-proposals/README.md | 2 +- .../klip-52-bytes-data-type-support.md | 156 ++++++++++++++++++ 2 files changed, 157 insertions(+), 1 deletion(-) create mode 100644 design-proposals/klip-52-bytes-data-type-support.md diff --git a/design-proposals/README.md b/design-proposals/README.md index 3c40ab6dbd0f..15f823446684 100644 --- a/design-proposals/README.md +++ b/design-proposals/README.md @@ -92,4 +92,4 @@ Next KLIP number: **53** | KLIP-49: Add source stream/table semantic | Proposal | | | | | KLIP-50: Partition and offset in ksqlDB | Proposal | 0.23.0 | | [Discussion](https://github.com/confluentinc/ksql/pull/7505) | | [KLIP-51: ksqlDB .NET LINQ provider](klip-51-ksqldb .NET LINQ provider.md) | Proposal | | | [Discussion](https://github.com/confluentinc/ksql/pull/6883) | -| KLIP-52: BYTES data type support | Proposal | 0.21.0 | | | +| [KLIP-52: BYTES data type support](klip-52-bytes-data-type-support.md) | Proposal | 0.21.0 | | | diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md new file mode 100644 index 000000000000..78c424535f08 --- /dev/null +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -0,0 +1,156 @@ +# KLIP 46 - BYTES Data Type Support + +**Author**: Zara Lim (@jzaralim) | +**Release Target**: 0.21 | +**Status**: _In Discussion_ | +**Discussion**: + +**tl;dr:** _Add support for the BYTES data type. This will allow users to work with BLOBs of data that don't fit into any other data type._ + +## Motivation and background + +Currently, ksqlDB can only handle a set of primitive types and combinations of them. +A BYTES data type would allow users to work with data that does not fit into any of +the primitive types such as images, as well as BLOB/binary data from other databases. + +## What is in scope +* Add BYTES type to KSQL +* Support BYTE comparisons +* Support BYTES usage in STRUCT, MAP and ARRAY +* Serialization and de-serialization of BYTES to Avro, JSON, Protobuf and Delimited formats +* Adding/updating UDFs to support the BYTES type +* Casting between BYTES and STRING + +## What is not in scope +* Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size` +key in a bytes schema, but this will not be included in this KLIP. + +## Public APIS + +### BYTES + +The BYTES data type will store an array of raw bytes of an unspecified length. The maximum size of +the array is limited by the maximum size of a Kafka message, as well as possibly by the value format being used. +The syntax is as follows: + +```roomsql +CREATE STREAM stream_name (b BYTES, COL2 STRING) AS ... +CREATE TABLE table_name (col1 STRUCT) AS ... +``` + +By default, BYTES will be displayed in console as HEX strings, where each byte is represented by two characters. +For example, the byte array `[91, 67]` will be displayed as: + +```roomsql +> SELECT b from STREAM; +'0x5B43' +``` + +Users can also represent BYTES as HEX strings, for example + +```roomsql +> INSERT INTO STREAM VALUES ('0x5b43', 'string value'); +``` + +The input and output formats can be configured using a new property, `ksql.bytes.format`. +The accepted encodings are `hex`, `utf8`, `ascii`, and `base64`. + +### UDF + +The following UDFs will be added: + +* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format. +The allowed encoders are the same as the ones allowed in the existing `encode` function. +* `decode(bytes, inputEncoding, outputEncoding)` - this will convert a BYTES value in the specified encoding format to a STRING in the specified encoding format. + +We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter, +then it will work on bytes for a BYTES parameter. + +* `len(bytes)` - This will return the length of the stored ByteArray. +* `concat(bytes...)` - Concatenate an arbitrary number of byte fields +* `r/lpad(bytes, target_length, padding_bytes)` - pads input BYTES beginning from the left/right with the specified padding BYTES until the target length is reached. +* `replace(bytes, old_bytes, new_bytes)` - returns the given BYTES value with all occurrences of `old_bytes` with `new_bytes` +* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter +* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them +* `substring(bytes, to, from)` - returns the section of the BYTES from the byte at position `to` to `from` + +## Design +### Serialization/Deserialization + +BYTES will be handled by [`java.nio.ByteBuffer`](https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html) within ksqlDB. +The underlying Kafka Connect type is the primitive `bytes` type. + +#### Avro + +`bytes` is a primitive Avro type. When converting to/from Connect data, the Avro converter ksqlDB +uses converts byte arrays to ByteBuffer. + +#### Protobuf + +`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 232. +When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer. + +#### JSON/Delimited + +Byte arrays will be stored in JSON and CSV files as [Base64 MIME](https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html#mime) encoded binary values. +This is because ksqlDB and Schema Registry both use Jackson to serialize and deserialize JSON, +and Jackson serializes binaries to Base64 strings. + +The ksqlDB JSON and delimited deserializers will be updated to convert Base64 strings to ByteBuffer. + +### Casting + +Casting BYTES to STRING will convert the BYTES value to a STRING value encoded by the format specified +by `ksql.bytes.format`. Casting STRING to BYTES will convert the STRING to a BYTES value decoded by the +format specified by `ksql.bytes.format`. + +Some other alternatives are: +* Use UTF-8 for all casts, and throw if it fails (BigQuery does this) +* Not support casting. This would make BYTES the only data type that cannot be cast to STRING. + +### Comparisons + +Comparisons will only be allowed between two BYTES. They will be compared lexicographically by +unsigned 8-bit values. For example, the following comparisons evaluate to `TRUE`: + +``` +[10, 11] > [10] +[12] > [10, 11] +``` + +## Test plan + +There will need to be tests for the following: +* Integration with Kafka Connect and Schema Registry +* All serialization formats +* Different types of byte data +* QTTs with all of the new and updated UDFs + +## LOEs and Delivery Milestones + +The implementation can both be broken up as follows: +* Adding the BYTES type to ksqlDB - 2 days +* Serialization/deserialization - 4 days +* Documentation - 2 days +* Add to Connect integration test - 1 day +* Casting - 2 days +* Comparisons - 2 days +* Adding UDFs + documentation - 1 week +* Buffer time and manual testing - 3 days + +## Documentation Updates + +* Add and update UDFs to `docs/developer-guide/ksqldb-reference/scalar-functions.md` +* Serialization/deserialization information in `docs/reference/serialization.md` +* Section on casting in `docs/developer-guide/ksqldb-reference/type-coercion.md` +* Detailed description of `BYTES` in `docs/reference/sql/data-types.md` +* New section in `docs/developer-guide/ksqldb-reference/operations.md` for comparisons + +## Compatibility Implications + +If a user issues a command that includes the BYTES type, then previous versions of KSQL will not +recognize the BYTES type, and the server will enter a DEGRADED state. + +## Security Implications + +None \ No newline at end of file From fcd52b844b13e82813c1ffa97a9ac9c6b8f1dd87 Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Tue, 6 Jul 2021 11:03:18 -0700 Subject: [PATCH 2/6] update with discussion links --- design-proposals/README.md | 2 +- design-proposals/klip-52-bytes-data-type-support.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/design-proposals/README.md b/design-proposals/README.md index 15f823446684..92b15d9ff625 100644 --- a/design-proposals/README.md +++ b/design-proposals/README.md @@ -92,4 +92,4 @@ Next KLIP number: **53** | KLIP-49: Add source stream/table semantic | Proposal | | | | | KLIP-50: Partition and offset in ksqlDB | Proposal | 0.23.0 | | [Discussion](https://github.com/confluentinc/ksql/pull/7505) | | [KLIP-51: ksqlDB .NET LINQ provider](klip-51-ksqldb .NET LINQ provider.md) | Proposal | | | [Discussion](https://github.com/confluentinc/ksql/pull/6883) | -| [KLIP-52: BYTES data type support](klip-52-bytes-data-type-support.md) | Proposal | 0.21.0 | | | +| [KLIP-52: BYTES data type support](klip-52-bytes-data-type-support.md) | Proposal | 0.21.0 | | [Discussion](https://github.com/confluentinc/ksql/pull/7764) | diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md index 78c424535f08..d4d3eba0408d 100644 --- a/design-proposals/klip-52-bytes-data-type-support.md +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -3,7 +3,7 @@ **Author**: Zara Lim (@jzaralim) | **Release Target**: 0.21 | **Status**: _In Discussion_ | -**Discussion**: +**Discussion**: https://github.com/confluentinc/ksql/pull/7764 **tl;dr:** _Add support for the BYTES data type. This will allow users to work with BLOBs of data that don't fit into any other data type._ From cc62b42ec3d3783101a9f9e6c7cdd1afe013cb05 Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Wed, 7 Jul 2021 16:33:38 -0700 Subject: [PATCH 3/6] Update klip-52-bytes-data-type-support.md --- .../klip-52-bytes-data-type-support.md | 28 +++++-------------- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md index d4d3eba0408d..959e8693f406 100644 --- a/design-proposals/klip-52-bytes-data-type-support.md +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -19,7 +19,6 @@ the primitive types such as images, as well as BLOB/binary data from other datab * Support BYTES usage in STRUCT, MAP and ARRAY * Serialization and de-serialization of BYTES to Avro, JSON, Protobuf and Delimited formats * Adding/updating UDFs to support the BYTES type -* Casting between BYTES and STRING ## What is not in scope * Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size` @@ -46,22 +45,16 @@ For example, the byte array `[91, 67]` will be displayed as: '0x5B43' ``` -Users can also represent BYTES as HEX strings, for example - -```roomsql -> INSERT INTO STREAM VALUES ('0x5b43', 'string value'); -``` - -The input and output formats can be configured using a new property, `ksql.bytes.format`. -The accepted encodings are `hex`, `utf8`, `ascii`, and `base64`. +Implicit conversions to BYTES will not be supported. ### UDF The following UDFs will be added: -* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format. -The allowed encoders are the same as the ones allowed in the existing `encode` function. -* `decode(bytes, inputEncoding, outputEncoding)` - this will convert a BYTES value in the specified encoding format to a STRING in the specified encoding format. +* `to_bytes(string, encoding)` - this will convert a STRING value to BYTES in the specified encoding. +The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. +* `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding. +The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter, then it will work on bytes for a BYTES parameter. @@ -100,13 +93,7 @@ The ksqlDB JSON and delimited deserializers will be updated to convert Base64 st ### Casting -Casting BYTES to STRING will convert the BYTES value to a STRING value encoded by the format specified -by `ksql.bytes.format`. Casting STRING to BYTES will convert the STRING to a BYTES value decoded by the -format specified by `ksql.bytes.format`. - -Some other alternatives are: -* Use UTF-8 for all casts, and throw if it fails (BigQuery does this) -* Not support casting. This would make BYTES the only data type that cannot be cast to STRING. +Casting between BYTES and other data types will not be supported. Users can use `to_bytes` and `from_bytes` if they would like to convert to/from STRING. ### Comparisons @@ -133,7 +120,6 @@ The implementation can both be broken up as follows: * Serialization/deserialization - 4 days * Documentation - 2 days * Add to Connect integration test - 1 day -* Casting - 2 days * Comparisons - 2 days * Adding UDFs + documentation - 1 week * Buffer time and manual testing - 3 days @@ -153,4 +139,4 @@ recognize the BYTES type, and the server will enter a DEGRADED state. ## Security Implications -None \ No newline at end of file +None From 20ba7ea1df62ed9c4fbb502e62c4d815f7971850 Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Thu, 8 Jul 2021 14:13:00 -0700 Subject: [PATCH 4/6] Update klip-52-bytes-data-type-support.md --- design-proposals/klip-52-bytes-data-type-support.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md index 959e8693f406..3f890eb29d63 100644 --- a/design-proposals/klip-52-bytes-data-type-support.md +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -1,4 +1,4 @@ -# KLIP 46 - BYTES Data Type Support +# KLIP 52 - BYTES Data Type Support **Author**: Zara Lim (@jzaralim) | **Release Target**: 0.21 | @@ -51,7 +51,7 @@ Implicit conversions to BYTES will not be supported. The following UDFs will be added: -* `to_bytes(string, encoding)` - this will convert a STRING value to BYTES in the specified encoding. +* `to_bytes(string, encoding)` - this will convert a STRING value in the specified encoding to BYTES. The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. * `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding. The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. From 883d934b8c223d6e9da95ad87a71fa8aeaabbb72 Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Fri, 9 Jul 2021 10:32:29 -0700 Subject: [PATCH 5/6] Update klip-52-bytes-data-type-support.md --- design-proposals/klip-52-bytes-data-type-support.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md index 3f890eb29d63..a28460d3f301 100644 --- a/design-proposals/klip-52-bytes-data-type-support.md +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -29,7 +29,7 @@ key in a bytes schema, but this will not be included in this KLIP. ### BYTES The BYTES data type will store an array of raw bytes of an unspecified length. The maximum size of -the array is limited by the maximum size of a Kafka message, as well as possibly by the value format being used. +the array is limited by the maximum size of a Kafka message, as well as possibly by the serialization format being used. The syntax is as follows: ```roomsql @@ -37,14 +37,16 @@ CREATE STREAM stream_name (b BYTES, COL2 STRING) AS ... CREATE TABLE table_name (col1 STRUCT) AS ... ``` -By default, BYTES will be displayed in console as HEX strings, where each byte is represented by two characters. +By default, BYTES will be displayed in the CLI as HEX strings, where each byte is represented by two characters. For example, the byte array `[91, 67]` will be displayed as: ```roomsql -> SELECT b from STREAM; -'0x5B43' +ksql> SELECT b from STREAM; +0x5B43 ``` +API response objects will store BYTES data as base64 strings. + Implicit conversions to BYTES will not be supported. ### UDF @@ -118,6 +120,7 @@ There will need to be tests for the following: The implementation can both be broken up as follows: * Adding the BYTES type to ksqlDB - 2 days * Serialization/deserialization - 4 days +* Add BYTES to the Java client - 2 days * Documentation - 2 days * Add to Connect integration test - 1 day * Comparisons - 2 days From 7e6299cfe0653e84425f2254f19fe67dc6d8fbec Mon Sep 17 00:00:00 2001 From: Zara Lim Date: Fri, 9 Jul 2021 10:37:15 -0700 Subject: [PATCH 6/6] Update klip-52-bytes-data-type-support.md --- design-proposals/klip-52-bytes-data-type-support.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/design-proposals/klip-52-bytes-data-type-support.md b/design-proposals/klip-52-bytes-data-type-support.md index a28460d3f301..52884b713bd4 100644 --- a/design-proposals/klip-52-bytes-data-type-support.md +++ b/design-proposals/klip-52-bytes-data-type-support.md @@ -45,7 +45,9 @@ ksql> SELECT b from STREAM; 0x5B43 ``` -API response objects will store BYTES data as base64 strings. +API response objects will store BYTES data as base64 strings. The Java client's `Row` class will include a new function, +`getBytes` that returns the value of a column as a `ByteBuffer` object. It will expect the raw value to be a Base64 string, +and if it's not then the function will throw an error. Implicit conversions to BYTES will not be supported.