Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: klip-52: bytes data type support #7764
docs: klip-52: bytes data type support #7764
Changes from 2 commits
0a89319
fcd52b8
cc62b42
20ba7ea
883d934
7e6299c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will not be supported? Doesn't fit on any use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't know a use case for fixed bytes yet. But it's ok to not implement it for now. We don't have fixed strings either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ksql>
prompt.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 just saw later in the KLIP that JSON/delimited will use base64. That covers the REST question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBytes
on the response object.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarifications to the KLIP on how BYTES will be handled by the REST API/Java client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to be displayed with single quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this. Maybe not, since it's not a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vote for no wrapping in single quotes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this.
'0x5b43'
may be seen as a String. Mysql, for instance:The
0x61
is converted from string to bytes. Could users also expect this conversion? Could we provide conversion functions instead of doing implicit conversions here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if users are seeing HEX outputs, then its reasonable to expect HEX strings to work as inputs (Postgres does this). That being said, not doing implicit conversions is also reasonable. Let's go forward with no implicit conversions + no casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed implicit conversions from KLIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about letting users use functions instead of setting configs? It will be easier to identify how the data is displayed by seeing the schema with
describe ...
when using functions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works. The only downside is, the output format will always be a HEX string, though the user can use a conversion function if they want to see another output format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support functions for different encodings instead of configs. IMHO this is going to let users use different formats in different queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like
to_bytes
andfrom_bytes
? Or did you have something else in mind?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed property from KLIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to specify the output encoding if the result is only bytes?
to_bytes('my name', 'utf8')
should convert UTF-8 String to bytes, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing
encode
function accepts these two parameters, so it can do things like convert a HEX string to UTF-8. Right nowto_bytes
basically doesencode
and then converts the result to bytes. Yeah just having one encoding parameter makes more sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP to remove outputEncoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about the output encoding. Btw, I think we should have similar function names, either
to_bytes/from_bytes
orencode/decode
. I preferto_bytes/from_bytes
as it gives better understanding of the conversion.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed,
from_bytes
makes more sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP with new function name and removed outputEncoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these functions assume a particular encoding for the bytes, or will the functions all need to take a parameter to specify the encoding? If the latter, I wonder if it's (conceptually) simpler to just let users nest the
from_bytes()
function instead.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All DBs that support bytes/binary types also support functions to use strings or bytes as parameters. Using
from_bytes()
everytime they want to use a string function will have a different behavior, for instance:Say we have raw = [91, 67], then:
len(raw) = 2
, butlen(from_bytes(raw, 'hex')) = 6
becausefrom_bytes(raw, 'hex') = '0x5B43'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, so basically they just operate on raw bytes, so no encoding is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there particular use-cases for using bytes with these functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can think of, but supporting binary for string functions is common across almost all databases and there are people on Stack Overflow who use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substring with bytes feels awkward. I would expect the function to be called
slice
. Can we make the existingslice
function that takes a collection argument work on bytes also?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of databases support
substring
for binary/blob, so that might actually be expected.In general, it seems like binary array types are seen as closer to text types than actual arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about functions for converting bytes to other data types such as integers or other types with standard bytes encodings? I see below that
CAST()
will not be supported for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most other systems don't support this. If needed, the user could cast the int to a string and then call
to_bytes
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a maximum length for BYTES now that we have a max. in protobuf? I assume this limit will be handled by Protobuf, but what would happen when I try to insert 2^32+1 bytes into a protobuf topic? should it fail or truncated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size limit for a string in Protobuf is also 2^32. That's more than 4GB. I suspect that at that size, the ksqlDB code itself would run into problems before Protobuf serialization is a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Seems kafka maximum message size is 1GB? Is there a maximum size allowed in Connect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if what the maximum message size is, but it's configurable with the default being 1 MB. Same with Connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is base64 how connect also serializes bytes types?
I wonder if we should have (not on this PR) an encoding method as part of the column type, i.e.
create stream s1 (b bytes encoding 'hex')
. It might even work with strings when we want them utf-16 or utf-8. Just throwing ideas here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For JSON yes. For Delimited, it's configurable, but the default is UTF_8. I'd still rather have them be the same though.
Yeah I think that would be nice to have if the all the SR serializer encodings become configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make our own serializers for this. Anyway, it is a nice suggestion for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for not supporting casting. Instead, users can use conversion functions with the specified format. Easier than handling configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated KLIP to remove support for casting