-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: klip-52: bytes data type support #7764
Conversation
* Casting between BYTES and STRING | ||
|
||
## What is not in scope | ||
* Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will not be supported? Doesn't fit on any use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- The purpose of this KLIP is to support raw data that doesn't fit into other types, and to make it easier to handle data from other sources. BYTES can do both of these on its own.
- Fixed size byte arrays are not supported in Protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't know a use case for fixed bytes yet. But it's ok to not implement it for now. We don't have fixed strings either.
Users can also represent BYTES as HEX strings, for example | ||
|
||
```roomsql | ||
> INSERT INTO STREAM VALUES ('0x5b43', 'string value'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. '0x5b43'
may be seen as a String. Mysql, for instance:
insert into t1(a) values ('0x61');
mysql> select * from t1;
+------------+
| a |
+------------+
| 0x30783631 |
+------------+
The 0x61
is converted from string to bytes. Could users also expect this conversion? Could we provide conversion functions instead of doing implicit conversions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if users are seeing HEX outputs, then its reasonable to expect HEX strings to work as inputs (Postgres does this). That being said, not doing implicit conversions is also reasonable. Let's go forward with no implicit conversions + no casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed implicit conversions from KLIP
> INSERT INTO STREAM VALUES ('0x5b43', 'string value'); | ||
``` | ||
|
||
The input and output formats can be configured using a new property, `ksql.bytes.format`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about letting users use functions instead of setting configs? It will be easier to identify how the data is displayed by seeing the schema with describe ...
when using functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works. The only downside is, the output format will always be a HEX string, though the user can use a conversion function if they want to see another output format.
``` | ||
|
||
The input and output formats can be configured using a new property, `ksql.bytes.format`. | ||
The accepted encodings are `hex`, `utf8`, `ascii`, and `base64`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support functions for different encodings instead of configs. IMHO this is going to let users use different formats in different queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like to_bytes
and from_bytes
? Or did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed property from KLIP
|
||
The following UDFs will be added: | ||
|
||
* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to specify the output encoding if the result is only bytes? to_bytes('my name', 'utf8')
should convert UTF-8 String to bytes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing encode
function accepts these two parameters, so it can do things like convert a HEX string to UTF-8. Right now to_bytes
basically does encode
and then converts the result to bytes. Yeah just having one encoding parameter makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP to remove outputEncoding
|
||
* `to_bytes(string, inputEncoding, outputEncoding)` - this will convert a STRING value in the specified encoding format to a BYTES in the specified encoding format. | ||
The allowed encoders are the same as the ones allowed in the existing `encode` function. | ||
* `decode(bytes, inputEncoding, outputEncoding)` - this will convert a BYTES value in the specified encoding format to a STRING in the specified encoding format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about the output encoding. Btw, I think we should have similar function names, either to_bytes/from_bytes
or encode/decode
. I prefer to_bytes/from_bytes
as it gives better understanding of the conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, from_bytes
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP with new function name and removed outputEncoding
`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 2<sup>32</sup>. | ||
When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a maximum length for BYTES now that we have a max. in protobuf? I assume this limit will be handled by Protobuf, but what would happen when I try to insert 2^32+1 bytes into a protobuf topic? should it fail or truncated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size limit for a string in Protobuf is also 2^32. That's more than 4GB. I suspect that at that size, the ksqlDB code itself would run into problems before Protobuf serialization is a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Seems kafka maximum message size is 1GB? Is there a maximum size allowed in Connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if what the maximum message size is, but it's configurable with the default being 1 MB. Same with Connect.
|
||
Some other alternatives are: | ||
* Use UTF-8 for all casts, and throw if it fails (BigQuery does this) | ||
* Not support casting. This would make BYTES the only data type that cannot be cast to STRING. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for not supporting casting. Instead, users can use conversion functions with the specified format. Easier than handling configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP to remove support for casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting stuff @jzaralim ! A couple questions inline.
* `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding. | ||
The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. | ||
|
||
We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these functions assume a particular encoding for the bytes, or will the functions all need to take a parameter to specify the encoding? If the latter, I wonder if it's (conceptually) simpler to just let users nest the from_bytes()
function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All DBs that support bytes/binary types also support functions to use strings or bytes as parameters. Using from_bytes()
everytime they want to use a string function will have a different behavior, for instance:
Say we have raw = [91, 67], then:
len(raw) = 2
, but len(from_bytes(raw, 'hex')) = 6
because from_bytes(raw, 'hex') = '0x5B43'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, so basically they just operate on raw bytes, so no encoding is needed.
* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter | ||
* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them | ||
* `substring(bytes, to, from)` - returns the section of the BYTES from the byte at position `to` to `from` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about functions for converting bytes to other data types such as integers or other types with standard bytes encodings? I see below that CAST()
will not be supported for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most other systems don't support this. If needed, the user could cast the int to a string and then call to_bytes
.
* Casting between BYTES and STRING | ||
|
||
## What is not in scope | ||
* Fixed sized BYTES (`BYTES(3)` representing 3 bytes, for example) - This is supported by Kafka Connect by adding the `connect.fixed.size` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't know a use case for fixed bytes yet. But it's ok to not implement it for now. We don't have fixed strings either.
|
||
```roomsql | ||
> SELECT b from STREAM; | ||
'0x5B43' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to be displayed with single quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this. Maybe not, since it's not a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vote for no wrapping in single quotes.
|
||
The following UDFs will be added: | ||
|
||
* `to_bytes(string, encoding)` - this will convert a STRING value to BYTES in the specified encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is convert a STRING value in the specified encoding to BYTES, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Will edit this to make it clearer.
* `from_bytes(bytes, encoding)` - this will convert a BYTES value to STRING in the specified encoding. | ||
The accepted encoders are `hex`, `utf8`, `ascii`, and `base64`. | ||
|
||
We will also update some of the existing STRING functions to accept BYTES as a parameter. In general, if a function works on ASCII characters for a STRING parameter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All DBs that support bytes/binary types also support functions to use strings or bytes as parameters. Using from_bytes()
everytime they want to use a string function will have a different behavior, for instance:
Say we have raw = [91, 67], then:
len(raw) = 2
, but len(from_bytes(raw, 'hex')) = 6
because from_bytes(raw, 'hex') = '0x5B43'
`bytes` is a primitive Protobuf type. The maximum number of bytes in a byte array is 2<sup>32</sup>. | ||
When converting to/from Connect data, the Avro converter ksqlDB uses converts byte arrays to ByteBuffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Seems kafka maximum message size is 1GB? Is there a maximum size allowed in Connect?
|
||
#### JSON/Delimited | ||
|
||
Byte arrays will be stored in JSON and CSV files as [Base64 MIME](https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html#mime) encoded binary values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is base64 how connect also serializes bytes types?
I wonder if we should have (not on this PR) an encoding method as part of the column type, i.e. create stream s1 (b bytes encoding 'hex')
. It might even work with strings when we want them utf-16 or utf-8. Just throwing ideas here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For JSON yes. For Delimited, it's configurable, but the default is UTF_8. I'd still rather have them be the same though.
Yeah I think that would be nice to have if the all the SR serializer encodings become configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make our own serializers for this. Anyway, it is a nice suggestion for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @jzaralim! Had some questions/suggestions below ...
CREATE TABLE table_name (col1 STRUCT<field BYTES>) AS ... | ||
``` | ||
|
||
By default, BYTES will be displayed in console as HEX strings, where each byte is represented by two characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, BYTES will be displayed in console as HEX strings
- Does "console" refer strictly to the ksqlDB CLI? If so, the example above should use the
ksql>
prompt. - Are we choosing hex here for consistency with other databases?
- What about the REST API? With JSON, it's typical to send byte arrays as base64 strings.
- Should the Java client automatically convert their wire representation into ByteBuffer objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 just saw later in the KLIP that JSON/delimited will use base64. That covers the REST question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yeah it's the CLI. The two common CLI representations in other databases are hex and UTF-8. IMO hex is easier to use with because sometimes UTF-8 representations show weird characters.
- The Java client should keep base64 strings as is until a user wants it converted to ByteBuffer (if they call
getBytes
on the response object.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarifications to the KLIP on how BYTES will be handled by the REST API/Java client.
* `replace(bytes, old_bytes, new_bytes)` - returns the given BYTES value with all occurrences of `old_bytes` with `new_bytes` | ||
* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter | ||
* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there particular use-cases for using bytes with these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can think of, but supporting binary for string functions is common across almost all databases and there are people on Stack Overflow who use them.
* `replace(bytes, old_bytes, new_bytes)` - returns the given BYTES value with all occurrences of `old_bytes` with `new_bytes` | ||
* `split(bytes, delimiter)` - splits a BYTES value into an array of BYTES based on a delimiter | ||
* `splittomap(bytes, entryDelimiter, kvDelimiter)` - splits a BYTES value into key-value pairs based on a delimiter and creates a MAP from them | ||
* `substring(bytes, to, from)` - returns the section of the BYTES from the byte at position `to` to `from` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substring with bytes feels awkward. I would expect the function to be called slice
. Can we make the existing slice
function that takes a collection argument work on bytes also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of databases support substring
for binary/blob, so that might actually be expected.
In general, it seems like binary array types are seen as closer to text types than actual arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jzaralim. All looks good for me. Perhaps you want to make sure @colinhicks agrees on the proposal too.
* docs: klip-52: bytes data type support * update with discussion links * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md * Update klip-52-bytes-data-type-support.md Co-authored-by: Zara Lim <[email protected]>
Description
Proposal for adding BYTES type to ksqlDB.
https://github.com/confluentinc/ksql/blob/fcd52b844b13e82813c1ffa97a9ac9c6b8f1dd87/design-proposals/klip-52-bytes-data-type-support.md
Reviewer checklist