-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine _utf8 and _binary kernels #2969
Comments
I couldn't check the code yet, but UTF-8 comparison is different from byte comparison because of the normalization (or the lack of it), right? Also gt / lt is locale specific? |
We don't provide locale aware string comparison, in part because there isn't Rust ecosystem support for it. We solely provide byte-based ordering, same as the standard Ord |
Btw I didn't find native rust locale tools a year ago, but this now looks ok?! https://github.com/unicode-org/icu4x |
Much like locale aware sorting, the same is true of normalization. There isn't mature ecosystem support, yet, nor a motivated contributor, and so we don't currently support it. Is there a particular use-case that motivates your asking about this? I was under the perhaps naive impression that most DBs were moving away from locale aware string handling - postgres supports it but specifically advises against using it as it dramatically hurts performance, not to mention all the normal reproducibility pain inherent to locales... |
For the normalization: I had issues before and I remembered Utf8 is not simply a byte array. For localization: similar, I'm speaking several languages and I was surprised by the "assumption" that byte order is always the same. I wasn't sure they were considered and skipped or this didn't come up at all. I agree it's a big chunk of work and the performance is always worse. It's not essential, just wanted to raise if you are making design decisions, these questions can help making an informed decision instead of a lucky one :) I'm all good with proceeding, nothing actionable from my side |
Thanks, yeah I thought you were referring to unicode normalisation, which is its own wondrous thing as there are redundant codings for the same text. As you say not all byte arrays are valid UTF-8 we must and do perform validation of this at construction time. |
* Add GenericByteBuilder (#2969) * RAT * Apply suggestions from code review Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
With #2947 we can now write kernels that are generic over both byte arrays and string arrays. We have a large number of kernels that with duplicate implementations for both, e.g. gt_eq_dyn_binary and gt_eq_dyn_utf8.
Describe the solution you'd like
We should create a new unified kernel, e.g. gt_eq_dyn_bytes, and make the specialized kernels just call through to this.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: