Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels #7656

Closed
wants to merge 28 commits into from

Conversation

maartenbreddels
Copy link
Contributor

Quite a few issues showed up:

  • utf8proc doesn't store and expose the information if a codepoint is of a Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum) correctly for the unicode versions.
  • utf8proc doesn't store and expose extra information about casing (from DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but also an upper case letter. For isupper, I have added that information (since they are just in a few blocks). For islower, this is quite a list. We could have some way to encoded this, but that could be a maintenance burden.
  • It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other). This has an effect on isalpha, isprintable, isupper, islower.

These issues showed up when writing the Python test to compareagainst CPython. The test lists all the issues and all the codepoints that give issues.

I wonder what the best way forward is? Ideally, libutf8proc would implement this.

Would this be a reason to not merge this PR, or can we live with this (I'm ok with it as it is, minus the performance).

Performance:

IsAlphaAscii_median     13929427 ns     13925663 ns            3 bytes_per_second=1.11116G/s items_per_second=75.2981M/s
IsAlphaUnicode_median   35342242 ns     35338347 ns            3 bytes_per_second=448.378M/s items_per_second=29.6725M/s

I think the performance is reasonable for ascii, but I think we can do a lookup table for the call to utf8proc_category(..), which should speed up the unicode version.

Naming

I've used the _ascii and _unicode suffixes, instead of _utf8, since for semantic, we don't care about the encoding. It should not matter how the string is encoded (utf8/16/32).

@github-actions
Copy link

github-actions bot commented Jul 7, 2020

@wesm
Copy link
Member

wesm commented Jul 7, 2020

I haven't looked at the details yet, but could we use a normalized naming convention for the functions?

  • ascii_$FUNC, for functions that expect ASCII input
  • utf8_$FUNC, for functions that expect UTF8 input
  • binary_$FUNC, for functions that work on any binary data

@maartenbreddels
Copy link
Contributor Author

I wanted to focus on the other issues first, but I propose a change in naming by this:
<arrow type>_<func>_<variant>, since all you care about is that it's a function that operates on a string and all the ascii variants with with all utf8 strings. Also, I don't know if in the future there will be utf16/utf32 support, in that case, this naming scheme makes more sense. I'm ok changing it back or open a PR to make this consistent (string_lower_ascii, string_lower_uncode, ... etc).

@wesm
Copy link
Member

wesm commented Jul 7, 2020

OK, I would also like to solicit @xhochy's and @pitrou's preferences in the function naming. I don't want to spend too much time on it, but as long as we have a consistent naming scheme that accommodates the anticipated use cases, that sounds fine to me.

@xhochy
Copy link
Member

xhochy commented Jul 7, 2020

I'm indifferent to utf8_ vs string_ (sometimes our codebase is too) but other than that I fully agree to @wesm naming scheme.

@wesm
Copy link
Member

wesm commented Jul 7, 2020

I think that having e.g. string_lower_utf8 and string_lower_ascii is fine, too.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

We should be consistent. We already have "utf8_lower" and "utf8_upper".

@maartenbreddels
Copy link
Contributor Author

Are there plans to support utf16/32? Seeing the code as it is now, it would be trivial to add.

@xhochy
Copy link
Member

xhochy commented Jul 7, 2020

Are there plans to support utf16/32? Seeing the code as it is now, it would be trivial to add.

This would be a longer discussion. Personally I would vote "no" to limit the scope / dimensionality of Arrow's memory model for now.

@maartenbreddels
Copy link
Contributor Author

If it is not off the table, I propose at least utf8->unicode renaming, to be future compatible.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

"utf8" is used everywhere, please keep it like this.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

Are there plans to support utf16/32?

I think that should be out of scope. There's no Arrow data type for utf16 and utf32.

@maartenbreddels
Copy link
Contributor Author

I am not talking about the C++ code, only the kernel name. UTF8 is just the encoding, Unicode refers to the semantics of the kernel. But if utf16/32 if excluded, I guess we can keep the old naming scheme.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2020

I am talking about the kernel name as well. "unicode" is non-descriptive.

@stevengj
Copy link

stevengj commented Jul 8, 2020

It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General category Letter Other).

U+08BE was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables?

utf8proc doesn't store and expose the information if a codepoint is of a Numeric type

Can't you use the Unicode category (N*) for this? That's what Julia does.

@@ -183,6 +188,53 @@ struct UTF8Transform {
}
};

template <typename StringType, typename Derived>
struct BinaryToBoolean {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could also make BinaryContainsExact use this struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that you ask, would you say my approach is more, or less readable? I prefer the CRTP pattern. But it needs some changes to suit your needs (it needs to use non-statics, since it needs to store the pattern/table). If you prefer my approach I can refactor your code when I start on startswith or the regex version of your kernel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on the same level of readability. There is the complication that BinaryContainsExact needs options but that will be the same with startswith or the regex version (or the case-insensitve one).

TYPED_TEST(TestStringKernels, IsAlphaNumericUnicode) {
// U+08BE (utf8: \xE0\xA2\xBE) is undefined, but utf8proc things it is
// UTF8PROC_CATEGORY_LO
this->CheckUnary("string_isalnum_unicode", "[\"ⱭɽⱤoW123\", null, \"Ɑ2\", \"!\", \"\"]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect tests with non-alphanumeric characters, e.g. do we get true or false for "some text"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -95,6 +95,56 @@ def test_binary_contains_exact():
assert expected.equals(result)


# utf8proc claims 0x8be is UTF8PROC_CATEGORY_LO
utf8proc_issue_isalpha = {0x8be, 0x8bf, 0x8c0, 0x8c1, 0x8c2, 0x8c3, 0x8c4, 0x8c5, 0x8c6, 0x8c7, 0xd04, 0xe86, 0xe89, 0xe8c, 0xe8e, 0xe8f, 0xe90, 0xe91, 0xe92, 0xe93, 0xe98, 0xea0, 0xea8, 0xea9, 0xeac, 0x1cf2, 0x1cf3, 0x1cfa, 0x31bb, 0x31bc, 0x31bd, 0x31be, 0x31bf, 0x4db6, 0x4db7, 0x4db8, 0x4db9, 0x4dba, 0x4dbb, 0x4dbc, 0x4dbd, 0x4dbe, 0x4dbf, 0x9ff0, 0x9ff1, 0x9ff2, 0x9ff3, 0x9ff4, 0x9ff5, 0x9ff6, 0x9ff7, 0x9ff8, 0x9ff9, 0x9ffa, 0x9ffb, 0x9ffc, 0xa7ba, 0xa7bb, 0xa7bc, 0xa7bd, 0xa7be, 0xa7bf, 0xa7c2, 0xa7c3, 0xa7c4, 0xa7c5, 0xa7c6, 0xa7c7, 0xa7c8, 0xa7c9, 0xa7ca, 0xa7f5, 0xa7f6, 0xab66, 0xab67, 0xab68, 0xab69, 0x10e80, 0x10e81, 0x10e82, 0x10e83, 0x10e84, 0x10e85, 0x10e86, 0x10e87, 0x10e88, 0x10e89,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... what is the point of this exactly? We don't want to test or document utf8proc's behaviour exhaustively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed changes on this. We'd like to see how the Python and Arrow implementations match since this is going to be used as an alternative for the Python implementation. It runs really fast, so performance is not an issue.

variant = 'ascii' if ascii else 'unicode'
arrow_name = f'string_{function_name}_{variant}'
py_name = function_name
for i in range(128 if ascii else 0x11000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really don't want to test the entire unicode range in pure Python.
Just do a couple examples like in C++, IMHO (possibly different examples).

Also, I would expect some tests of the behaviour of "ascii" functions on non-Ascii text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs fast, so performance is not an issue. As it is now, all non-ascii values are 'pass through'. We didn't formally agree on this, but this is how we started (with ascii_lower) and seems like a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is now, all non-ascii values are 'pass through'

Is it tested for somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, many of the C++ ascii tests pass in utf8 data to test if it simply gets passed through.

@pitrou
Copy link
Member

pitrou commented Jul 8, 2020

As for the performance, I think 400 MB/s is perfectly acceptable. I don't know in which kind of situation these functions would be performance-critical.

@maartenbreddels
Copy link
Contributor Author

U+08BE was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables?

Thanks for that, as I replied in the issue on utf8proc, I didn't expect the Unicode data to change that fast (I guess Python3.7 doesn't support Unicode 13, information that is difficult to find actually).

Can't you use the Unicode category (N*) for this? That's what Julia does.

That's how I implemented it now, for instance https://graphemica.com/%E6%9F%92 has a numeric value of 7 (it's an example from the Unicode spec v13, section 4.6 http://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf ). Python lists this as numeric assert '柒'.isnumeric() == True, but it's General Category is 'Other letter'.

I didn't open an issue, because I'm not sure where this information is, I have difficulty mapping between the spec, what Python does and the Unicode data files.

And to be honest, I don't fully understand it's, and it's a small list:
㐅, 㒃, 㠪, 㭍, 一, 七, 万, 三, 九, 二, 五, 亖, 亿, 什, 仟, 仨, 伍, 佰, 億, 兆, 兩, 八, 六, 十, 千, 卄, 卅, 卌, 叁, 参, 參, 叄, 四, 壱, 壹, 幺, 廾, 廿, 弌, 弍, 弎, 弐, 拾, 捌, 柒, 漆, 玖, 百, 肆, 萬, 貮, 貳, 贰, 阡, 陆, 陌, 陸, 零, 參, 拾, 兩, 零, 六, 陸, 什,

@pitrou
Copy link
Member

pitrou commented Jul 8, 2020

The Python unicodedata module gives you the Unicode spec version:

Python 3.7.5 (default, Nov  7 2019, 10:50:52) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.unidata_version
'11.0.0'
Python 3.8.0+ (heads/3.8:fe934e1d03, Nov  2 2019, 16:21:38) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.unidata_version
'12.1.0'

@pitrou
Copy link
Member

pitrou commented Jul 8, 2020

Also, for the record, the Python str methods are implemented in Objects/unicodeobject.c, for example:
https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L12051-L12094

@maartenbreddels
Copy link
Contributor Author

Thanks, I didn't know about unicodedata. Yeah, I've taken a look at that, because sometimes the code is more clear than the documentation.

@pitrou
Copy link
Member

pitrou commented Jul 8, 2020

Also, the predicates are defined here:
https://github.com/python/cpython/blob/master/Objects/unicodectype.c


# Python claims there are not alpha, not sure why, they are in
# gc='Other Letter': https://graphemica.com/%E1%B3%B2
unknown_issue_isalpha = {0x1cf2, 0x1cf3}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puzzled by this:

unicodedata.category(chr(0x1cf2)) == 'Mc'

Seems like CPython bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm
Copy link
Member

wesm commented Jul 12, 2020

The ASAN/UBSAN test failure is caused by outdated utf8proc (Ubuntu 18.04 has 2.1.0 which has test failures), so I'm switching it to use the BUNDLED version.

I'll refactor the usages of BinaryToBoolean to not use CRTP and confirm that there is no performance difference.

@wesm
Copy link
Member

wesm commented Jul 12, 2020

I simplified many of the templates to not require an Arrow type parameter and do less inlining (which doesn't seem to make much of a difference performance wise and it inflates binary sizes).

I made a couple other tweaks to the BinaryToBoolean helper, benchmarks are within tolerance except IsAlphaNumericUnicode which got 20% faster it seems

               benchmark         baseline        contender  change %             counters
3  IsAlphaNumericUnicode  639.335 MiB/sec  785.883 MiB/sec    22.922   {'iterations': 29}
0              Utf8Lower  447.866 MiB/sec  467.099 MiB/sec     4.294   {'iterations': 20}
5              Utf8Upper  460.699 MiB/sec  467.486 MiB/sec     1.473   {'iterations': 20}
1    BinaryContainsExact  581.487 MiB/sec  564.348 MiB/sec    -2.947   {'iterations': 26}
2    IsAlphaNumericAscii  904.964 MiB/sec  876.430 MiB/sec    -3.153   {'iterations': 42}
6             AsciiLower   10.095 GiB/sec    9.711 GiB/sec    -3.802  {'iterations': 440}
4             AsciiUpper    9.972 GiB/sec    9.453 GiB/sec    -5.206  {'iterations': 443}

@wesm
Copy link
Member

wesm commented Jul 12, 2020

ah I think I figured out the mystery of the failed compilation -- it's caused by the missing utf8proc.

@wesm
Copy link
Member

wesm commented Jul 12, 2020

+1. Here are the benchmarks with gcc-8 comparing the last commit with 3739e66 which is right before I started making changes

               benchmark         baseline        contender  change %             counters
6  IsAlphaNumericUnicode  644.115 MiB/sec  829.133 MiB/sec    28.724   {'iterations': 29}
2              Utf8Lower  453.035 MiB/sec  478.970 MiB/sec     5.725   {'iterations': 20}
4              Utf8Upper  462.962 MiB/sec  488.924 MiB/sec     5.608   {'iterations': 21}
3             AsciiUpper    9.913 GiB/sec   10.098 GiB/sec     1.864  {'iterations': 446}
0             AsciiLower    9.956 GiB/sec    9.951 GiB/sec    -0.050  {'iterations': 435}
5    BinaryContainsExact  582.241 MiB/sec  570.008 MiB/sec    -2.101   {'iterations': 26}
1    IsAlphaNumericAscii  977.104 MiB/sec  920.929 MiB/sec    -5.749   {'iterations': 43}

I think +/- 5% is basically noise on these so merging. Thanks @maartenbreddels!

@maartenbreddels
Copy link
Contributor Author

ah I think I figured out the mystery of the failed compilation -- it's caused by the missing utf8proc.

Of course, now the error msg makes sense. Thanks for finishing this!

FYI: For me the benchmarks are about the same (maybe a 5% slower for IsAlphaNumericUnicode, AMD Ryzen / gcc 9.3)

@maartenbreddels maartenbreddels deleted the ARROW-9268 branch July 13, 2020 10:30
@pitrou
Copy link
Member

pitrou commented Jul 13, 2020

Ah, I hadn't noticed this was merged. This will need documenting in docs/source/cpp/compute.rst.

@wesm
Copy link
Member

wesm commented Jul 13, 2020

Yes, indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants