Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine _utf8 and _binary kernels #2969

Closed
tustvold opened this issue Oct 28, 2022 · 7 comments
Closed

Combine _utf8 and _binary kernels #2969

tustvold opened this issue Oct 28, 2022 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

With #2947 we can now write kernels that are generic over both byte arrays and string arrays. We have a large number of kernels that with duplicate implementations for both, e.g. gt_eq_dyn_binary and gt_eq_dyn_utf8.

Describe the solution you'd like

We should create a new unified kernel, e.g. gt_eq_dyn_bytes, and make the specialized kernels just call through to this.

Describe alternatives you've considered

Additional context

@tustvold tustvold added good first issue Good for newcomers enhancement Any new improvement worthy of a entry in the changelog help wanted labels Oct 28, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 28, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 28, 2022
tustvold added a commit that referenced this issue Nov 1, 2022
* Generalize filter byte array (#2969)

* Fix doc

* Update comment
tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 15, 2022
@alippai
Copy link
Contributor

alippai commented Nov 16, 2022

I couldn't check the code yet, but UTF-8 comparison is different from byte comparison because of the normalization (or the lack of it), right? Also gt / lt is locale specific?

@tustvold
Copy link
Contributor Author

We don't provide locale aware string comparison, in part because there isn't Rust ecosystem support for it. We solely provide byte-based ordering, same as the standard Ord

@alippai
Copy link
Contributor

alippai commented Nov 17, 2022

I agree that the locale based sorting can be out of scope for now. What do you think regarding the normalization?
image

@alippai
Copy link
Contributor

alippai commented Nov 17, 2022

Btw I didn't find native rust locale tools a year ago, but this now looks ok?! https://github.com/unicode-org/icu4x

@tustvold
Copy link
Contributor Author

Much like locale aware sorting, the same is true of normalization. There isn't mature ecosystem support, yet, nor a motivated contributor, and so we don't currently support it.

Is there a particular use-case that motivates your asking about this? I was under the perhaps naive impression that most DBs were moving away from locale aware string handling - postgres supports it but specifically advises against using it as it dramatically hurts performance, not to mention all the normal reproducibility pain inherent to locales...

@alippai
Copy link
Contributor

alippai commented Nov 18, 2022

For the normalization: I had issues before and I remembered Utf8 is not simply a byte array.

For localization: similar, I'm speaking several languages and I was surprised by the "assumption" that byte order is always the same.

I wasn't sure they were considered and skipped or this didn't come up at all. I agree it's a big chunk of work and the performance is always worse. It's not essential, just wanted to raise if you are making design decisions, these questions can help making an informed decision instead of a lucky one :)

I'm all good with proceeding, nothing actionable from my side

@tustvold
Copy link
Contributor Author

Thanks, yeah I thought you were referring to unicode normalisation, which is its own wondrous thing as there are redundant codings for the same text. As you say not all byte arrays are valid UTF-8 we must and do perform validation of this at construction time.

tustvold added a commit that referenced this issue Nov 21, 2022
* Add GenericByteBuilder (#2969)

* RAT

* Apply suggestions from code review

Co-authored-by: Liang-Chi Hsieh <[email protected]>

Co-authored-by: Liang-Chi Hsieh <[email protected]>
@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

No branches or pull requests

2 participants