ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels #7656

maartenbreddels · 2020-07-07T11:55:08Z

Quite a few issues showed up:

utf8proc doesn't store and expose the information if a codepoint is of a Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum) correctly for the unicode versions.
utf8proc doesn't store and expose extra information about casing (from DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but also an upper case letter. For isupper, I have added that information (since they are just in a few blocks). For islower, this is quite a list. We could have some way to encoded this, but that could be a maintenance burden.
It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other). This has an effect on isalpha, isprintable, isupper, islower.

These issues showed up when writing the Python test to compareagainst CPython. The test lists all the issues and all the codepoints that give issues.

I wonder what the best way forward is? Ideally, libutf8proc would implement this.

Would this be a reason to not merge this PR, or can we live with this (I'm ok with it as it is, minus the performance).

Performance:

IsAlphaAscii_median     13929427 ns     13925663 ns            3 bytes_per_second=1.11116G/s items_per_second=75.2981M/s
IsAlphaUnicode_median   35342242 ns     35338347 ns            3 bytes_per_second=448.378M/s items_per_second=29.6725M/s

I think the performance is reasonable for ascii, but I think we can do a lookup table for the call to utf8proc_category(..), which should speed up the unicode version.

Naming

I've used the _ascii and _unicode suffixes, instead of _utf8, since for semantic, we don't care about the encoding. It should not matter how the string is encoded (utf8/16/32).

github-actions · 2020-07-07T12:05:17Z

https://issues.apache.org/jira/browse/ARROW-9268

wesm · 2020-07-07T15:39:02Z

I haven't looked at the details yet, but could we use a normalized naming convention for the functions?

ascii_$FUNC, for functions that expect ASCII input
utf8_$FUNC, for functions that expect UTF8 input
binary_$FUNC, for functions that work on any binary data

maartenbreddels · 2020-07-07T16:51:54Z

I wanted to focus on the other issues first, but I propose a change in naming by this:
<arrow type>_<func>_<variant>, since all you care about is that it's a function that operates on a string and all the ascii variants with with all utf8 strings. Also, I don't know if in the future there will be utf16/utf32 support, in that case, this naming scheme makes more sense. I'm ok changing it back or open a PR to make this consistent (string_lower_ascii, string_lower_uncode, ... etc).

wesm · 2020-07-07T16:57:21Z

OK, I would also like to solicit @xhochy's and @pitrou's preferences in the function naming. I don't want to spend too much time on it, but as long as we have a consistent naming scheme that accommodates the anticipated use cases, that sounds fine to me.

xhochy · 2020-07-07T17:00:14Z

I'm indifferent to utf8_ vs string_ (sometimes our codebase is too) but other than that I fully agree to @wesm naming scheme.

wesm · 2020-07-07T17:02:52Z

I think that having e.g. string_lower_utf8 and string_lower_ascii is fine, too.

pitrou · 2020-07-07T17:04:51Z

We should be consistent. We already have "utf8_lower" and "utf8_upper".

maartenbreddels · 2020-07-07T17:05:12Z

Are there plans to support utf16/32? Seeing the code as it is now, it would be trivial to add.

xhochy · 2020-07-07T17:06:38Z

Are there plans to support utf16/32? Seeing the code as it is now, it would be trivial to add.

This would be a longer discussion. Personally I would vote "no" to limit the scope / dimensionality of Arrow's memory model for now.

maartenbreddels · 2020-07-07T17:08:32Z

If it is not off the table, I propose at least utf8->unicode renaming, to be future compatible.

pitrou · 2020-07-07T17:11:39Z

"utf8" is used everywhere, please keep it like this.

pitrou · 2020-07-07T17:12:06Z

Are there plans to support utf16/32?

I think that should be out of scope. There's no Arrow data type for utf16 and utf32.

maartenbreddels · 2020-07-07T17:17:13Z

I am not talking about the C++ code, only the kernel name. UTF8 is just the encoding, Unicode refers to the semantics of the kernel. But if utf16/32 if excluded, I guess we can keep the old naming scheme.

pitrou · 2020-07-07T18:13:21Z

I am talking about the kernel name as well. "unicode" is non-descriptive.

stevengj · 2020-07-08T03:01:44Z

It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other).

U+08BE was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables?

utf8proc doesn't store and expose the information if a codepoint is of a Numeric type

Can't you use the Unicode category (N*) for this? That's what Julia does.

xhochy · 2020-07-08T08:33:24Z

cpp/src/arrow/compute/kernels/scalar_string.cc

@@ -183,6 +188,53 @@ struct UTF8Transform {
  }
 };

+template <typename StringType, typename Derived>
+struct BinaryToBoolean {


Do you think you could also make BinaryContainsExact use this struct?

Good that you ask, would you say my approach is more, or less readable? I prefer the CRTP pattern. But it needs some changes to suit your needs (it needs to use non-statics, since it needs to store the pattern/table). If you prefer my approach I can refactor your code when I start on startswith or the regex version of your kernel.

It's on the same level of readability. There is the complication that BinaryContainsExact needs options but that will be the same with startswith or the regex version (or the case-insensitve one).

pitrou · 2020-07-08T09:56:01Z

cpp/src/arrow/compute/kernels/scalar_string_test.cc

+TYPED_TEST(TestStringKernels, IsAlphaNumericUnicode) {
+  // U+08BE (utf8: 	\xE0\xA2\xBE) is undefined, but utf8proc things it is
+  // UTF8PROC_CATEGORY_LO
+  this->CheckUnary("string_isalnum_unicode", "[\"ⱭɽⱤoW123\", null, \"Ɑ2\", \"!\", \"\"]",


I would expect tests with non-alphanumeric characters, e.g. do we get true or false for "some text"?

cpp/src/arrow/compute/kernels/scalar_string_test.cc

pitrou · 2020-07-08T10:01:51Z

python/pyarrow/tests/test_compute.py

@@ -95,6 +95,56 @@ def test_binary_contains_exact():
    assert expected.equals(result)


+# utf8proc claims 0x8be is UTF8PROC_CATEGORY_LO
+utf8proc_issue_isalpha = {0x8be, 0x8bf, 0x8c0, 0x8c1, 0x8c2, 0x8c3, 0x8c4, 0x8c5, 0x8c6, 0x8c7, 0xd04, 0xe86, 0xe89, 0xe8c, 0xe8e, 0xe8f, 0xe90, 0xe91, 0xe92, 0xe93, 0xe98, 0xea0, 0xea8, 0xea9, 0xeac, 0x1cf2, 0x1cf3, 0x1cfa, 0x31bb, 0x31bc, 0x31bd, 0x31be, 0x31bf, 0x4db6, 0x4db7, 0x4db8, 0x4db9, 0x4dba, 0x4dbb, 0x4dbc, 0x4dbd, 0x4dbe, 0x4dbf, 0x9ff0, 0x9ff1, 0x9ff2, 0x9ff3, 0x9ff4, 0x9ff5, 0x9ff6, 0x9ff7, 0x9ff8, 0x9ff9, 0x9ffa, 0x9ffb, 0x9ffc, 0xa7ba, 0xa7bb, 0xa7bc, 0xa7bd, 0xa7be, 0xa7bf, 0xa7c2, 0xa7c3, 0xa7c4, 0xa7c5, 0xa7c6, 0xa7c7, 0xa7c8, 0xa7c9, 0xa7ca, 0xa7f5, 0xa7f6, 0xab66, 0xab67, 0xab68, 0xab69, 0x10e80, 0x10e81, 0x10e82, 0x10e83, 0x10e84, 0x10e85, 0x10e86, 0x10e87, 0x10e88, 0x10e89,


Hmm... what is the point of this exactly? We don't want to test or document utf8proc's behaviour exhaustively.

I've just pushed changes on this. We'd like to see how the Python and Arrow implementations match since this is going to be used as an alternative for the Python implementation. It runs really fast, so performance is not an issue.

pitrou · 2020-07-08T10:02:46Z

python/pyarrow/tests/test_compute.py

+    variant = 'ascii' if ascii else 'unicode'
+    arrow_name = f'string_{function_name}_{variant}'
+    py_name = function_name
+    for i in range(128 if ascii else 0x11000):


We really don't want to test the entire unicode range in pure Python.
Just do a couple examples like in C++, IMHO (possibly different examples).

Also, I would expect some tests of the behaviour of "ascii" functions on non-Ascii text.

It runs fast, so performance is not an issue. As it is now, all non-ascii values are 'pass through'. We didn't formally agree on this, but this is how we started (with ascii_lower) and seems like a good idea.

As it is now, all non-ascii values are 'pass through'

Is it tested for somewhere?

Yes, many of the C++ ascii tests pass in utf8 data to test if it simply gets passed through.

pitrou · 2020-07-08T10:04:34Z

As for the performance, I think 400 MB/s is perfectly acceptable. I don't know in which kind of situation these functions would be performance-critical.

maartenbreddels · 2020-07-08T10:09:21Z

U+08BE was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables?

Thanks for that, as I replied in the issue on utf8proc, I didn't expect the Unicode data to change that fast (I guess Python3.7 doesn't support Unicode 13, information that is difficult to find actually).

Can't you use the Unicode category (N*) for this? That's what Julia does.

That's how I implemented it now, for instance https://graphemica.com/%E6%9F%92 has a numeric value of 7 (it's an example from the Unicode spec v13, section 4.6 http://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf ). Python lists this as numeric assert '柒'.isnumeric() == True, but it's General Category is 'Other letter'.

I didn't open an issue, because I'm not sure where this information is, I have difficulty mapping between the spec, what Python does and the Unicode data files.

And to be honest, I don't fully understand it's, and it's a small list:
㐅, 㒃, 㠪, 㭍, 一, 七, 万, 三, 九, 二, 五, 亖, 亿, 什, 仟, 仨, 伍, 佰, 億, 兆, 兩, 八, 六, 十, 千, 卄, 卅, 卌, 叁, 参, 參, 叄, 四, 壱, 壹, 幺, 廾, 廿, 弌, 弍, 弎, 弐, 拾, 捌, 柒, 漆, 玖, 百, 肆, 萬, 貮, 貳, 贰, 阡, 陆, 陌, 陸, 零, 參, 拾, 兩, 零, 六, 陸, 什,

pitrou · 2020-07-08T10:11:43Z

The Python unicodedata module gives you the Unicode spec version:

Python 3.7.5 (default, Nov  7 2019, 10:50:52) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.unidata_version
'11.0.0'

Python 3.8.0+ (heads/3.8:fe934e1d03, Nov  2 2019, 16:21:38) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.unidata_version
'12.1.0'

pitrou · 2020-07-08T10:15:38Z

Also, for the record, the Python str methods are implemented in Objects/unicodeobject.c, for example:
https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L12051-L12094

maartenbreddels · 2020-07-08T10:16:47Z

Thanks, I didn't know about unicodedata. Yeah, I've taken a look at that, because sometimes the code is more clear than the documentation.

pitrou · 2020-07-08T10:17:26Z

Also, the predicates are defined here:
https://github.com/python/cpython/blob/master/Objects/unicodectype.c

maartenbreddels · 2020-07-08T10:28:12Z

python/pyarrow/tests/test_compute.py

+
+# Python claims there are not alpha, not sure why, they are in
+#  gc='Other Letter': https://graphemica.com/%E1%B3%B2
+unknown_issue_isalpha = {0x1cf2, 0x1cf3}


Puzzled by this:

unicodedata.category(chr(0x1cf2)) == 'Mc'

Seems like CPython bug?

https://www.fileformat.info/info/unicode/char/1cf2/index.htm

…eral utf8 char if it does not recognize it?)

wesm · 2020-07-12T00:32:32Z

The ASAN/UBSAN test failure is caused by outdated utf8proc (Ubuntu 18.04 has 2.1.0 which has test failures), so I'm switching it to use the BUNDLED version.

I'll refactor the usages of BinaryToBoolean to not use CRTP and confirm that there is no performance difference.

wesm · 2020-07-12T01:25:48Z

I simplified many of the templates to not require an Arrow type parameter and do less inlining (which doesn't seem to make much of a difference performance wise and it inflates binary sizes).

I made a couple other tweaks to the BinaryToBoolean helper, benchmarks are within tolerance except IsAlphaNumericUnicode which got 20% faster it seems

               benchmark         baseline        contender  change %             counters
3  IsAlphaNumericUnicode  639.335 MiB/sec  785.883 MiB/sec    22.922   {'iterations': 29}
0              Utf8Lower  447.866 MiB/sec  467.099 MiB/sec     4.294   {'iterations': 20}
5              Utf8Upper  460.699 MiB/sec  467.486 MiB/sec     1.473   {'iterations': 20}
1    BinaryContainsExact  581.487 MiB/sec  564.348 MiB/sec    -2.947   {'iterations': 26}
2    IsAlphaNumericAscii  904.964 MiB/sec  876.430 MiB/sec    -3.153   {'iterations': 42}
6             AsciiLower   10.095 GiB/sec    9.711 GiB/sec    -3.802  {'iterations': 440}
4             AsciiUpper    9.972 GiB/sec    9.453 GiB/sec    -5.206  {'iterations': 443}

… issue

wesm · 2020-07-12T01:56:45Z

ah I think I figured out the mystery of the failed compilation -- it's caused by the missing utf8proc.

wesm · 2020-07-12T04:47:18Z

+1. Here are the benchmarks with gcc-8 comparing the last commit with 3739e66 which is right before I started making changes

               benchmark         baseline        contender  change %             counters
6  IsAlphaNumericUnicode  644.115 MiB/sec  829.133 MiB/sec    28.724   {'iterations': 29}
2              Utf8Lower  453.035 MiB/sec  478.970 MiB/sec     5.725   {'iterations': 20}
4              Utf8Upper  462.962 MiB/sec  488.924 MiB/sec     5.608   {'iterations': 21}
3             AsciiUpper    9.913 GiB/sec   10.098 GiB/sec     1.864  {'iterations': 446}
0             AsciiLower    9.956 GiB/sec    9.951 GiB/sec    -0.050  {'iterations': 435}
5    BinaryContainsExact  582.241 MiB/sec  570.008 MiB/sec    -2.101   {'iterations': 26}
1    IsAlphaNumericAscii  977.104 MiB/sec  920.929 MiB/sec    -5.749   {'iterations': 43}

I think +/- 5% is basically noise on these so merging. Thanks @maartenbreddels!

maartenbreddels · 2020-07-13T10:17:05Z

ah I think I figured out the mystery of the failed compilation -- it's caused by the missing utf8proc.

Of course, now the error msg makes sense. Thanks for finishing this!

FYI: For me the benchmarks are about the same (maybe a 5% slower for IsAlphaNumericUnicode, AMD Ryzen / gcc 9.3)

pitrou · 2020-07-13T18:33:06Z

Ah, I hadn't noticed this was merged. This will need documenting in docs/source/cpp/compute.rst.

wesm · 2020-07-13T19:06:05Z

Yes, indeed.

maartenbreddels mentioned this pull request Jul 7, 2020

Undefined character listed as UTF8PROC_CATEGORY_LO JuliaStrings/utf8proc#194

Closed

maartenbreddels mentioned this pull request Jul 7, 2020

Include case information from DerivedCoreProperties.txt JuliaStrings/utf8proc#195

Closed

xhochy reviewed Jul 8, 2020

View reviewed changes

pitrou reviewed Jul 8, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved

pitrou reviewed Jul 8, 2020

View reviewed changes

xhochy mentioned this pull request Jul 8, 2020

🚀 String Super-Issue xhochy/fletcher#121

Closed

maartenbreddels commented Jul 8, 2020

View reviewed changes

maartenbreddels and others added 11 commits July 11, 2020 19:23

some compilers look at unused templates

35091f5

misplaced ifdefs

876e1b8

use lookup table and use isalnum for benchmark

5227023

use utf8 sequence in string (maybe the compiler does not like the lit…

65d674d

…eral utf8 char if it does not recognize it?)

Fix: was not using lut

cd426d4

performance increase: ~15-20%

b435b94

use different high codepoint that does not require unicode 13

cb9365c

Use 'ArrowType' for template parameter instead of clashing 'StringType'

f7a4931

compiler bug workaround

1e8c5c5

wrong template argument

68868bd

change crtp inheritance

3739e66

CRTP -> inline lambdas

e9628a6

wesm force-pushed the ARROW-9268 branch from fe8d3ee to e9628a6 Compare July 12, 2020 00:48

Instantiate fewer templates

33bb8bb

wesm added 3 commits July 11, 2020 20:38

Some fixes, try something else to see if it fixes the gcc compilation…

a160803

… issue

Move LUT initialization out of if block

2efd9c8

Compile without utf8proc

ce7869d

Skip failing test on older utf8proc

bc5a2ec

wesm closed this in a5914d5 Jul 12, 2020

maartenbreddels mentioned this pull request Jul 13, 2020

Refactor: use Apache Arrow compute for string function vaexio/vaex#885

Merged

maartenbreddels deleted the ARROW-9268 branch July 13, 2020 10:30

This was referenced Jul 15, 2020

[C++] Add is{alnum,alpha,...} kernels for strings #25363

Closed

[C++][Doc] Undocumented compute functions (string_isalpha, etc.) #25517

Closed

ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels #7656

ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels #7656

Conversation

maartenbreddels commented Jul 7, 2020

Performance:

Naming

github-actions bot commented Jul 7, 2020

wesm commented Jul 7, 2020

maartenbreddels commented Jul 7, 2020

wesm commented Jul 7, 2020

xhochy commented Jul 7, 2020

wesm commented Jul 7, 2020

pitrou commented Jul 7, 2020

maartenbreddels commented Jul 7, 2020

xhochy commented Jul 7, 2020

maartenbreddels commented Jul 7, 2020

pitrou commented Jul 7, 2020

pitrou commented Jul 7, 2020

maartenbreddels commented Jul 7, 2020

pitrou commented Jul 7, 2020

stevengj commented Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jul 8, 2020

maartenbreddels commented Jul 8, 2020

pitrou commented Jul 8, 2020

pitrou commented Jul 8, 2020

maartenbreddels commented Jul 8, 2020

pitrou commented Jul 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm commented Jul 12, 2020

wesm commented Jul 12, 2020

wesm commented Jul 12, 2020

wesm commented Jul 12, 2020

maartenbreddels commented Jul 13, 2020

pitrou commented Jul 13, 2020

wesm commented Jul 13, 2020

stevengj commented Jul 8, 2020 •

edited

Loading