Combine _utf8 and _binary kernels #2969

tustvold · 2022-10-28T19:28:01Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

With #2947 we can now write kernels that are generic over both byte arrays and string arrays. We have a large number of kernels that with duplicate implementations for both, e.g. gt_eq_dyn_binary and gt_eq_dyn_utf8.

Describe the solution you'd like

We should create a new unified kernel, e.g. gt_eq_dyn_bytes, and make the specialized kernels just call through to this.

Describe alternatives you've considered

Additional context

* Generalize filter byte array (#2969) * Fix doc * Update comment

alippai · 2022-11-16T20:07:38Z

I couldn't check the code yet, but UTF-8 comparison is different from byte comparison because of the normalization (or the lack of it), right? Also gt / lt is locale specific?

tustvold · 2022-11-16T21:10:09Z

We don't provide locale aware string comparison, in part because there isn't Rust ecosystem support for it. We solely provide byte-based ordering, same as the standard Ord

alippai · 2022-11-17T03:40:51Z

I agree that the locale based sorting can be out of scope for now. What do you think regarding the normalization?

alippai · 2022-11-17T03:48:34Z

Btw I didn't find native rust locale tools a year ago, but this now looks ok?! https://github.com/unicode-org/icu4x

tustvold · 2022-11-17T15:57:42Z

Much like locale aware sorting, the same is true of normalization. There isn't mature ecosystem support, yet, nor a motivated contributor, and so we don't currently support it.

Is there a particular use-case that motivates your asking about this? I was under the perhaps naive impression that most DBs were moving away from locale aware string handling - postgres supports it but specifically advises against using it as it dramatically hurts performance, not to mention all the normal reproducibility pain inherent to locales...

alippai · 2022-11-18T02:10:43Z

For the normalization: I had issues before and I remembered Utf8 is not simply a byte array.

For localization: similar, I'm speaking several languages and I was surprised by the "assumption" that byte order is always the same.

I wasn't sure they were considered and skipped or this didn't come up at all. I agree it's a big chunk of work and the performance is always worse. It's not essential, just wanted to raise if you are making design decisions, these questions can help making an informed decision instead of a lucky one :)

I'm all good with proceeding, nothing actionable from my side

tustvold · 2022-11-18T07:31:59Z

Thanks, yeah I thought you were referring to unicode normalisation, which is its own wondrous thing as there are redundant codings for the same text. As you say not all byte arrays are valid UTF-8 we must and do perform validation of this at construction time.

* Add GenericByteBuilder (#2969) * RAT * Apply suggestions from code review Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>

tustvold added good first issue Good for newcomers enhancement Any new improvement worthy of a entry in the changelog help wanted labels Oct 28, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 28, 2022

Combine take_utf8 and take_binary (apache#2969)

abe8ba5

tustvold mentioned this issue Oct 28, 2022

Combine take_utf8 and take_binary (#2969) #2970

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 28, 2022

Generalize filter byte array (apache#2969)

3c15bd8

tustvold mentioned this issue Oct 28, 2022

Specialize filter kernel for binary arrays (#2969) #2971

Merged

tustvold added a commit that referenced this issue Oct 28, 2022

Combine take_utf8 and take_binary (#2969) (#2970)

dbe518c

tustvold mentioned this issue Oct 29, 2022

Specialize interleave for byte arrays (#2864) #2975

Merged

tustvold added a commit that referenced this issue Nov 1, 2022

Specialize filter kernel for binary arrays (#2969) (#2971)

62e878e

* Generalize filter byte array (#2969) * Fix doc * Update comment

tustvold added a commit to tustvold/arrow-rs that referenced this issue Nov 15, 2022

Add GenericByteBuilder (apache#2969)

75a044d

tustvold mentioned this issue Nov 15, 2022

Add GenericByteBuilder (#2969) #3122

Merged

tustvold added a commit that referenced this issue Nov 21, 2022

Add GenericByteBuilder (#2969) (#3122)

b3dbe70

* Add GenericByteBuilder (#2969) * RAT * Apply suggestions from code review Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>

jackwener mentioned this issue Jun 1, 2023

Combine _utf8 and _binary kernels #4334

Closed

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine _utf8 and _binary kernels #2969

Combine _utf8 and _binary kernels #2969

tustvold commented Oct 28, 2022

alippai commented Nov 16, 2022

tustvold commented Nov 16, 2022

alippai commented Nov 17, 2022

alippai commented Nov 17, 2022

tustvold commented Nov 17, 2022

alippai commented Nov 18, 2022

tustvold commented Nov 18, 2022

Combine _utf8 and _binary kernels #2969

Combine _utf8 and _binary kernels #2969

Comments

tustvold commented Oct 28, 2022

alippai commented Nov 16, 2022

tustvold commented Nov 16, 2022

alippai commented Nov 17, 2022

alippai commented Nov 17, 2022

tustvold commented Nov 17, 2022

alippai commented Nov 18, 2022

tustvold commented Nov 18, 2022