Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels
Quite a few issues showed up: * utf8proc doesn't store and expose the information if a codepoint is of a Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum) correctly for the unicode versions. * utf8proc doesn't store and expose extra information about casing (from DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but also an upper case letter. For isupper, I have added that information (since they are just in a few blocks). For islower, this is quite a list. We could have some way to encoded this, but that could be a maintenance burden. * It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other). This has an effect on isalpha, isprintable, isupper, islower. These issues showed up when writing the Python test to compareagainst CPython. The test lists all the issues and all the codepoints that give issues. I wonder what the best way forward is? Ideally, libutf8proc would implement this. Would this be a reason to not merge this PR, or can we live with this (I'm ok with it as it is, minus the performance). # Performance: ``` IsAlphaAscii_median 13929427 ns 13925663 ns 3 bytes_per_second=1.11116G/s items_per_second=75.2981M/s IsAlphaUnicode_median 35342242 ns 35338347 ns 3 bytes_per_second=448.378M/s items_per_second=29.6725M/s ``` I think the performance is reasonable for ascii, but I think we can do a lookup table for the call to `utf8proc_category(..)`, which should speed up the unicode version. # Naming I've used the `_ascii` and `_unicode` suffixes, instead of `_utf8`, since for semantic, we don't care about the encoding. It should not matter how the string is encoded (utf8/16/32). Closes #7656 from maartenbreddels/ARROW-9268 Lead-authored-by: Maarten A. Breddels <[email protected]> Co-authored-by: Wes McKinney <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
- Loading branch information