-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The substring
kernel panics when chars > U+0x007F
#1478
Comments
@alamb please review. Thank you! |
Hi @HaoYang670 -- I suggest
I am not sure if the code already does 3 or not. The challenge with proper unicode support, from my perspective, is that it will likely be slower and require a new dependency (to identify the unicode graphemes). There is a unicode aware implementation of substr in the datafusion repo I believe contributed by @ovr. Another possibility is to add an optional feature flag to arrow-rs for "unicode" string support and base the behavior on that flag. But that sounds a little over complicated |
In the |
Being able to calculate unicode characters / graphemes without bringing in a new dependency would be great |
Great, I will create an issue! |
Describe the bug
The
substring
kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.To Reproduce
Steps to reproduce the behavior:
Give a string
"E=mc²"
, start index =-1
, length =None
.the expected result is "²".
However, I got:
The reason is that the char
²
is encoded as0xC2 0xB2
in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence[0xB2]
which is invalid in utf-8 standard.Expected behavior
I think there are three ways to fix the bug:
1.(easy) Update the doc of the
substring
function to explain we only support 1-byte utf-8 chars. Also explain thatstart
andlength
are counted in bytes.2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the
substring
function.3.(difficult, and the API will be changed) Intercept based on characters, not bytes.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: