docs: klip-52: bytes data type support #7764

jzaralim · 2021-07-06T18:02:22Z

Description

Proposal for adding BYTES type to ksqlDB.

https://github.com/confluentinc/ksql/blob/fcd52b844b13e82813c1ffa97a9ac9c6b8f1dd87/design-proposals/klip-52-bytes-data-type-support.md

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

spena · 2021-07-07T20:16:24Z

design-proposals/klip-52-bytes-data-type-support.md

+* Casting between BYTES and STRING
+
+## What is not in scope
+* Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size`


Why will not be supported? Doesn't fit on any use case?

Two reasons:

The purpose of this KLIP is to support raw data that doesn't fit into other types, and to make it easier to handle data from other sources. BYTES can do both of these on its own.

Fixed size byte arrays are not supported in Protobuf.

Ok. I don't know a use case for fixed bytes yet. But it's ok to not implement it for now. We don't have fixed strings either.

spena · 2021-07-07T20:35:54Z

design-proposals/klip-52-bytes-data-type-support.md

+Users can also represent BYTES as HEX strings, for example
+
+```roomsql
+> INSERT INTO STREAM VALUES ('0x5b43', 'string value');


I'm not sure about this. '0x5b43' may be seen as a String. Mysql, for instance:

insert into t1(a) values ('0x61'); mysql> select * from t1; +------------+ | a | +------------+ | 0x30783631 | +------------+

The 0x61 is converted from string to bytes. Could users also expect this conversion? Could we provide conversion functions instead of doing implicit conversions here?

I think that if users are seeing HEX outputs, then its reasonable to expect HEX strings to work as inputs (Postgres does this). That being said, not doing implicit conversions is also reasonable. Let's go forward with no implicit conversions + no casting.

Removed implicit conversions from KLIP

spena · 2021-07-07T20:37:07Z

design-proposals/klip-52-bytes-data-type-support.md

+> INSERT INTO STREAM VALUES ('0x5b43', 'string value');
+```
+
+The input and output formats can be configured using a new property, `ksql.bytes.format`.


What about letting users use functions instead of setting configs? It will be easier to identify how the data is displayed by seeing the schema with describe ... when using functions.

That works. The only downside is, the output format will always be a HEX string, though the user can use a conversion function if they want to see another output format.

spena · 2021-07-07T20:38:04Z

design-proposals/klip-52-bytes-data-type-support.md

+```
+
+The input and output formats can be configured using a new property, `ksql.bytes.format`.
+The accepted encodings are `hex`, `utf8`, `ascii`, and `base64`.


We should support functions for different encodings instead of configs. IMHO this is going to let users use different formats in different queries.

Do you mean something like to_bytes and from_bytes? Or did you have something else in mind?

Removed property from KLIP

spena · 2021-07-07T20:39:21Z

design-proposals/klip-52-bytes-data-type-support.md

+
+The following UDFs will be added:
+
+* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format.


Why do we need to specify the output encoding if the result is only bytes? to_bytes('my name', 'utf8') should convert UTF-8 String to bytes, right?

The existing encode function accepts these two parameters, so it can do things like convert a HEX string to UTF-8. Right now to_bytes basically does encode and then converts the result to bytes. Yeah just having one encoding parameter makes more sense.

Updated KLIP to remove outputEncoding

spena · 2021-07-07T20:40:18Z

design-proposals/klip-52-bytes-data-type-support.md

+
+* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format.
+The allowed encoders are the same as the ones allowed in the existing `encode` function.
+* `decode(bytes, inputEncoding, outputEncoding)` - this will convert a BYTES value in the specified encoding format to a STRING in the specified encoding format.


Same question about the output encoding. Btw, I think we should have similar function names, either to_bytes/from_bytes or encode/decode. I prefer to_bytes/from_bytes as it gives better understanding of the conversion.

Agreed, from_bytes makes more sense.

Updated KLIP with new function name and removed outputEncoding

spena · 2021-07-07T20:43:26Z

design-proposals/klip-52-bytes-data-type-support.md

+`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 2<sup>32</sup>.
+When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer.


Should we have a maximum length for BYTES now that we have a max. in protobuf? I assume this limit will be handled by Protobuf, but what would happen when I try to insert 2^32+1 bytes into a protobuf topic? should it fail or truncated?

The size limit for a string in Protobuf is also 2^32. That's more than 4GB. I suspect that at that size, the ksqlDB code itself would run into problems before Protobuf serialization is a concern.

Right. Seems kafka maximum message size is 1GB? Is there a maximum size allowed in Connect?

I'm not sure if what the maximum message size is, but it's configurable with the default being 1 MB. Same with Connect.

spena · 2021-07-07T20:45:38Z

design-proposals/klip-52-bytes-data-type-support.md

+
+Some other alternatives are:
+* Use UTF-8 for all casts, and throw if it fails (BigQuery does this)
+* Not support casting. This would make BYTES the only data type that cannot be cast to STRING.


I vote for not supporting casting. Instead, users can use conversion functions with the specified format. Easier than handling configs.

That sounds good to me.

Updated KLIP to remove support for casting

vcrfxia

Exciting stuff @jzaralim ! A couple questions inline.

design-proposals/klip-52-bytes-data-type-support.md

vcrfxia · 2021-07-08T16:54:10Z

design-proposals/klip-52-bytes-data-type-support.md

+* `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding.
+The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`.
+
+We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter,


Will these functions assume a particular encoding for the bytes, or will the functions all need to take a parameter to specify the encoding? If the latter, I wonder if it's (conceptually) simpler to just let users nest the from_bytes() function instead.

All DBs that support bytes/binary types also support functions to use strings or bytes as parameters. Using from_bytes() everytime they want to use a string function will have a different behavior, for instance:

Say we have raw = [91, 67], then:
len(raw) = 2, but len(from_bytes(raw, 'hex')) = 6 because from_bytes(raw, 'hex') = '0x5B43'

Yup, so basically they just operate on raw bytes, so no encoding is needed.

vcrfxia · 2021-07-08T16:55:29Z

design-proposals/klip-52-bytes-data-type-support.md

+* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter
+* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them
+* `substring(bytes, to, from)` - returns the section of the BYTES from the byte at position `to` to `from`
+


What about functions for converting bytes to other data types such as integers or other types with standard bytes encodings? I see below that CAST() will not be supported for this.

Most other systems don't support this. If needed, the user could cast the int to a string and then call to_bytes.

spena · 2021-07-08T17:06:06Z

design-proposals/klip-52-bytes-data-type-support.md

+* Casting between BYTES and STRING
+
+## What is not in scope
+* Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size`


Ok. I don't know a use case for fixed bytes yet. But it's ok to not implement it for now. We don't have fixed strings either.

spena · 2021-07-08T17:08:28Z

design-proposals/klip-52-bytes-data-type-support.md

+
+```roomsql
+> SELECT b from STREAM;
+'0x5B43'


Is it going to be displayed with single quotes?

I don't have a strong opinion on this. Maybe not, since it's not a string.

Vote for no wrapping in single quotes.

spena · 2021-07-08T17:11:36Z

design-proposals/klip-52-bytes-data-type-support.md

+
+The following UDFs will be added:
+
+* `to_bytes(string, encoding)` - this will convert a STRING value to BYTES in the specified encoding.


This is convert a STRING value in the specified encoding to BYTES, right?

Yup. Will edit this to make it clearer.

spena · 2021-07-08T17:18:43Z

design-proposals/klip-52-bytes-data-type-support.md

+* `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding.
+The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`.
+
+We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter,


All DBs that support bytes/binary types also support functions to use strings or bytes as parameters. Using from_bytes() everytime they want to use a string function will have a different behavior, for instance:

Say we have raw = [91, 67], then:
len(raw) = 2, but len(from_bytes(raw, 'hex')) = 6 because from_bytes(raw, 'hex') = '0x5B43'

spena · 2021-07-08T17:22:04Z

design-proposals/klip-52-bytes-data-type-support.md

+`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 2<sup>32</sup>.
+When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer.


Right. Seems kafka maximum message size is 1GB? Is there a maximum size allowed in Connect?

spena · 2021-07-08T17:31:22Z

design-proposals/klip-52-bytes-data-type-support.md

+
+#### JSON/Delimited
+
+Byte arrays will be stored in JSON and CSV files as [Base64 MIME](https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html#mime) encoded binary values.


Is base64 how connect also serializes bytes types?

I wonder if we should have (not on this PR) an encoding method as part of the column type, i.e. create stream s1 (b bytes encoding 'hex'). It might even work with strings when we want them utf-16 or utf-8. Just throwing ideas here.

For JSON yes. For Delimited, it's configurable, but the default is UTF_8. I'd still rather have them be the same though.

Yeah I think that would be nice to have if the all the SR serializer encodings become configurable.

We could make our own serializers for this. Anyway, it is a nice suggestion for the future.

colinhicks

This is great @jzaralim! Had some questions/suggestions below ...

design-proposals/klip-52-bytes-data-type-support.md

colinhicks · 2021-07-09T14:44:07Z

design-proposals/klip-52-bytes-data-type-support.md

+CREATE TABLE table_name (col1 STRUCT<field BYTES>) AS ...
+```
+
+By default, BYTES will be displayed in console as HEX strings, where each byte is represented by two characters.


By default, BYTES will be displayed in console as HEX strings

Does "console" refer strictly to the ksqlDB CLI? If so, the example above should use the ksql> prompt.

Are we choosing hex here for consistency with other databases?

What about the REST API? With JSON, it's typical to send byte arrays as base64 strings.

Should the Java client automatically convert their wire representation into ByteBuffer objects?

👍 just saw later in the KLIP that JSON/delimited will use base64. That covers the REST question above.

Yeah it's the CLI. The two common CLI representations in other databases are hex and UTF-8. IMO hex is easier to use with because sometimes UTF-8 representations show weird characters.

The Java client should keep base64 strings as is until a user wants it converted to ByteBuffer (if they call getBytes on the response object.)

Added some clarifications to the KLIP on how BYTES will be handled by the REST API/Java client.

colinhicks · 2021-07-09T14:49:19Z

design-proposals/klip-52-bytes-data-type-support.md

+* `replace(bytes, old_bytes, new_bytes)` - returns the given BYTES value with all occurrences of `old_bytes` with `new_bytes`
+* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter
+* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them


Are there particular use-cases for using bytes with these functions?

Not that I can think of, but supporting binary for string functions is common across almost all databases and there are people on Stack Overflow who use them.

colinhicks · 2021-07-09T14:50:49Z

design-proposals/klip-52-bytes-data-type-support.md

+* `replace(bytes, old_bytes, new_bytes)` - returns the given BYTES value with all occurrences of `old_bytes` with `new_bytes`
+* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter
+* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them
+* `substring(bytes, to, from)` - returns the section of the BYTES from the byte at position `to` to `from`


Substring with bytes feels awkward. I would expect the function to be called slice. Can we make the existing slice function that takes a collection argument work on bytes also?

A lot of databases support substring for binary/blob, so that might actually be expected.

In general, it seems like binary array types are seen as closer to text types than actual arrays.

spena

Thanks @jzaralim. All looks good for me. Perhaps you want to make sure @colinhicks agrees on the proposal too.

* docs: klip-52: bytes data type support * update with discussion links * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md Co-authored-by: Zara Lim <[email protected]>

docs: klip-52: bytes data type support

0a89319

jzaralim requested a review from a team as a code owner July 6, 2021 18:02

update with discussion links

fcd52b8

spena reviewed Jul 7, 2021

View reviewed changes

Update klip-52-bytes-data-type-support.md

cc62b42

vcrfxia reviewed Jul 8, 2021

View reviewed changes

spena reviewed Jul 8, 2021

View reviewed changes

Update klip-52-bytes-data-type-support.md

20ba7ea

colinhicks reviewed Jul 9, 2021

View reviewed changes

spena approved these changes Jul 9, 2021

View reviewed changes

Zara Lim added 2 commits July 9, 2021 10:32

Update klip-52-bytes-data-type-support.md

883d934

Update klip-52-bytes-data-type-support.md

7e6299c

jzaralim mentioned this pull request Jul 9, 2021

feat: add BYTES type to ksqlDB #7778

Merged

2 tasks

colinhicks approved these changes Jul 12, 2021

View reviewed changes

jzaralim merged commit 692a8bd into confluentinc:master Jul 12, 2021

jzaralim deleted the klip-52 branch July 12, 2021 19:50

JimGalasyn mentioned this pull request Jul 27, 2021

docs: cherry-pick: klip-52: bytes data type support (DOCS-9233) #7892

Merged

2 tasks


		The following UDFs will be added:

		* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format.

		`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 2<sup>32</sup>.
		When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer.


		The following UDFs will be added:

		* `to_bytes(string, encoding)` - this will convert a STRING value to BYTES in the specified encoding.


		#### JSON/Delimited

		Byte arrays will be stored in JSON and CSV files as [Base64 MIME](https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html#mime) encoded binary values.

docs: klip-52: bytes data type support #7764

docs: klip-52: bytes data type support #7764

Conversation

jzaralim commented Jul 6, 2021 • edited Loading

Description

Reviewer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vcrfxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinhicks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spena left a comment

Choose a reason for hiding this comment

jzaralim commented Jul 6, 2021 •

edited

Loading