Skip to content

Commit

Permalink
ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels
Browse files Browse the repository at this point in the history
Quite a few issues showed up:
 * utf8proc doesn't store and expose the information if a codepoint is of a Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum) correctly for the unicode versions.
 * utf8proc doesn't store and expose extra information about casing (from DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but also an upper case letter. For isupper, I have added that information (since they are just in a few blocks). For islower, this is quite a list. We could have some way to encoded this, but that could be a maintenance burden.
 * It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other). This has an effect on isalpha, isprintable, isupper, islower.

These issues showed up when writing the Python test to compareagainst CPython. The test lists all the issues and all the codepoints that give issues.

I wonder what the best way forward is? Ideally, libutf8proc would implement this.

Would this be a reason to not merge this PR, or can we live with this (I'm ok with it as it is, minus the performance).

# Performance:
```
IsAlphaAscii_median     13929427 ns     13925663 ns            3 bytes_per_second=1.11116G/s items_per_second=75.2981M/s
IsAlphaUnicode_median   35342242 ns     35338347 ns            3 bytes_per_second=448.378M/s items_per_second=29.6725M/s
```

I think the performance is reasonable for ascii, but I think we can do a lookup table for the call to `utf8proc_category(..)`, which should speed up the unicode version.

# Naming
I've used the `_ascii` and `_unicode` suffixes, instead of `_utf8`, since for semantic, we don't care about the encoding. It should not matter how the string is encoded (utf8/16/32).

Closes #7656 from maartenbreddels/ARROW-9268

Lead-authored-by: Maarten A. Breddels <[email protected]>
Co-authored-by: Wes McKinney <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
  • Loading branch information
maartenbreddels and wesm committed Jul 12, 2020
1 parent 3e940dc commit a5914d5
Show file tree
Hide file tree
Showing 7 changed files with 826 additions and 3 deletions.
Loading

0 comments on commit a5914d5

Please sign in to comment.