-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds support for 0x, X'...', x'...' type hex strings in udf:encode #6118
Conversation
It looks like @cprasad1 hasn't signed our Contributor License Agreement, yet.
You can read and sign our full Contributor License Agreement here. Once you've signed reply with Appreciation of efforts, clabot |
[clabot:check] |
@confluentinc It looks like @cprasad1 just signed our Contributor License Agreement. 👍 Always at your service, clabot |
@@ -1,3 +1,4 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an accidental change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this change from the PR please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @cprasad1 ! Congratz on your first PR!
Before merging we need to make sure that both all checks are green. We are still waiting for jenkins but there seems to be something wrong with the PR title as well (though it's not obvious to me what).
@@ -41,6 +41,15 @@ public void shouldEncodeHexToAscii() { | |||
assertThat(udf.encode("31202b2031203d2031", "hex", "ascii"), is("1 + 1 = 1")); | |||
assertThat(udf.encode("ce95cebbcebbceacceb4ceb1", "hex", "ascii"), is("������������")); | |||
assertThat(udf.encode("c39c6265726d656e736368", "hex", "ascii"), is("��bermensch")); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the string is 0x0x
? Or 0X
? Or X'
without a closing quote? Are there any characters that are not allowed with these formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that you are worried about illegal hex literals (eg: 'p', 'x', '*' etc) in these formats. The function would throw an exception and we would get nulls in the end which was the default behavior before this was implemented. That is because this feature piggybacks on the pre-existing hex decoders that existed before these changes. I am adding a few more tests to demonstrate these corrupted formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I noticed that the pre-existing encode udf couldn't deal with Hex of odd lengths (It would throw an exception). Now, it supports odd lengths by padding a "0"
to the left in the 0x...
format but would throw an exception with X'...'
and x'...'
formats which is consistent with https://dev.mysql.com/doc/refman/8.0/en/hexadecimal-literals.html. But I am not sure if we should support hex of odd lengths before without any of these special characters around them. It would just throw an exception like it used to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more tests to demonstrate that behavior
Not sure if this is part of a larger set of changes, but with this UDF, you already give the hint of what the string is encoded as, e.g. encode("4578616d706C6521", 'hex', 'ascii'). I thought that supporting a literal means that anywhere that you can use a normal ASCII string today, you could instead write 0x4578616d706C6521, without the need to explicitly call encode since the hint is in the "0x." I would assume that the internal representation is UTF8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Accidentally closed this. @agavra do we want to support the Hex literals via the |
I think this PR is a valuable improvement to the UDF, but the ticket (#5544) specifically refers to the ability to (as @AlanConfluent thought) write |
My bad, when I asked @agavra about the ticket he pointed me to the |
I haven't reviewed the PR, but I agree that the feature is valuable! Green light from my perspective (I trust your code review @vpapavas 😉 ). |
Yes, that makes sense. The same logic can potentially just be applied in a different place and it's fine to have that as part of the UDF as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
updated the PR description. Will merge as soon as checks are completed |
Description
Earlier, the encode udf would not recognize hex strings of types belong to {0x, X'...', x'...'} while converting to ascii, utf8 and base 64 and would give nulls. Now it recognizes them and converts them accordingly from hex to other formats. Behavior is similar to https://dev.mysql.com/doc/refman/8.0/en/hexadecimal-literals.html
Testing done
Unit + query-validation-tests + manual
Reviewer checklist