-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels #7656
Conversation
I haven't looked at the details yet, but could we use a normalized naming convention for the functions?
|
I wanted to focus on the other issues first, but I propose a change in naming by this: |
I'm indifferent to |
I think that having e.g. |
We should be consistent. We already have "utf8_lower" and "utf8_upper". |
Are there plans to support utf16/32? Seeing the code as it is now, it would be trivial to add. |
This would be a longer discussion. Personally I would vote "no" to limit the scope / dimensionality of Arrow's memory model for now. |
If it is not off the table, I propose at least utf8->unicode renaming, to be future compatible. |
"utf8" is used everywhere, please keep it like this. |
I think that should be out of scope. There's no Arrow data type for utf16 and utf32. |
I am not talking about the C++ code, only the kernel name. UTF8 is just the encoding, Unicode refers to the semantics of the kernel. But if utf16/32 if excluded, I guess we can keep the old naming scheme. |
I am talking about the kernel name as well. "unicode" is non-descriptive. |
U+08BE was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables?
Can't you use the Unicode category (N*) for this? That's what Julia does. |
@@ -183,6 +188,53 @@ struct UTF8Transform { | |||
} | |||
}; | |||
|
|||
template <typename StringType, typename Derived> | |||
struct BinaryToBoolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could also make BinaryContainsExact
use this struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good that you ask, would you say my approach is more, or less readable? I prefer the CRTP pattern. But it needs some changes to suit your needs (it needs to use non-statics, since it needs to store the pattern/table). If you prefer my approach I can refactor your code when I start on startswith
or the regex version of your kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on the same level of readability. There is the complication that BinaryContainsExact
needs options but that will be the same with startswith
or the regex version (or the case-insensitve one).
TYPED_TEST(TestStringKernels, IsAlphaNumericUnicode) { | ||
// U+08BE (utf8: \xE0\xA2\xBE) is undefined, but utf8proc things it is | ||
// UTF8PROC_CATEGORY_LO | ||
this->CheckUnary("string_isalnum_unicode", "[\"ⱭɽⱤoW123\", null, \"Ɑ2\", \"!\", \"\"]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect tests with non-alphanumeric characters, e.g. do we get true or false for "some text"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
python/pyarrow/tests/test_compute.py
Outdated
@@ -95,6 +95,56 @@ def test_binary_contains_exact(): | |||
assert expected.equals(result) | |||
|
|||
|
|||
# utf8proc claims 0x8be is UTF8PROC_CATEGORY_LO | |||
utf8proc_issue_isalpha = {0x8be, 0x8bf, 0x8c0, 0x8c1, 0x8c2, 0x8c3, 0x8c4, 0x8c5, 0x8c6, 0x8c7, 0xd04, 0xe86, 0xe89, 0xe8c, 0xe8e, 0xe8f, 0xe90, 0xe91, 0xe92, 0xe93, 0xe98, 0xea0, 0xea8, 0xea9, 0xeac, 0x1cf2, 0x1cf3, 0x1cfa, 0x31bb, 0x31bc, 0x31bd, 0x31be, 0x31bf, 0x4db6, 0x4db7, 0x4db8, 0x4db9, 0x4dba, 0x4dbb, 0x4dbc, 0x4dbd, 0x4dbe, 0x4dbf, 0x9ff0, 0x9ff1, 0x9ff2, 0x9ff3, 0x9ff4, 0x9ff5, 0x9ff6, 0x9ff7, 0x9ff8, 0x9ff9, 0x9ffa, 0x9ffb, 0x9ffc, 0xa7ba, 0xa7bb, 0xa7bc, 0xa7bd, 0xa7be, 0xa7bf, 0xa7c2, 0xa7c3, 0xa7c4, 0xa7c5, 0xa7c6, 0xa7c7, 0xa7c8, 0xa7c9, 0xa7ca, 0xa7f5, 0xa7f6, 0xab66, 0xab67, 0xab68, 0xab69, 0x10e80, 0x10e81, 0x10e82, 0x10e83, 0x10e84, 0x10e85, 0x10e86, 0x10e87, 0x10e88, 0x10e89, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... what is the point of this exactly? We don't want to test or document utf8proc's behaviour exhaustively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed changes on this. We'd like to see how the Python and Arrow implementations match since this is going to be used as an alternative for the Python implementation. It runs really fast, so performance is not an issue.
variant = 'ascii' if ascii else 'unicode' | ||
arrow_name = f'string_{function_name}_{variant}' | ||
py_name = function_name | ||
for i in range(128 if ascii else 0x11000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really don't want to test the entire unicode range in pure Python.
Just do a couple examples like in C++, IMHO (possibly different examples).
Also, I would expect some tests of the behaviour of "ascii" functions on non-Ascii text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It runs fast, so performance is not an issue. As it is now, all non-ascii values are 'pass through'. We didn't formally agree on this, but this is how we started (with ascii_lower
) and seems like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is now, all non-ascii values are 'pass through'
Is it tested for somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, many of the C++ ascii tests pass in utf8 data to test if it simply gets passed through.
As for the performance, I think 400 MB/s is perfectly acceptable. I don't know in which kind of situation these functions would be performance-critical. |
Thanks for that, as I replied in the issue on utf8proc, I didn't expect the Unicode data to change that fast (I guess Python3.7 doesn't support Unicode 13, information that is difficult to find actually).
That's how I implemented it now, for instance https://graphemica.com/%E6%9F%92 has a numeric value of 7 (it's an example from the Unicode spec v13, section 4.6 http://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf ). Python lists this as numeric I didn't open an issue, because I'm not sure where this information is, I have difficulty mapping between the spec, what Python does and the Unicode data files. And to be honest, I don't fully understand it's, and it's a small list: |
The Python
|
Also, for the record, the Python |
Thanks, I didn't know about unicodedata. Yeah, I've taken a look at that, because sometimes the code is more clear than the documentation. |
Also, the predicates are defined here: |
|
||
# Python claims there are not alpha, not sure why, they are in | ||
# gc='Other Letter': https://graphemica.com/%E1%B3%B2 | ||
unknown_issue_isalpha = {0x1cf2, 0x1cf3} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puzzled by this:
unicodedata.category(chr(0x1cf2)) == 'Mc'
Seems like CPython bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…eral utf8 char if it does not recognize it?)
The ASAN/UBSAN test failure is caused by outdated utf8proc (Ubuntu 18.04 has 2.1.0 which has test failures), so I'm switching it to use the BUNDLED version. I'll refactor the usages of |
I simplified many of the templates to not require an Arrow type parameter and do less inlining (which doesn't seem to make much of a difference performance wise and it inflates binary sizes). I made a couple other tweaks to the BinaryToBoolean helper, benchmarks are within tolerance except IsAlphaNumericUnicode which got 20% faster it seems
|
ah I think I figured out the mystery of the failed compilation -- it's caused by the missing utf8proc. |
+1. Here are the benchmarks with gcc-8 comparing the last commit with 3739e66 which is right before I started making changes
I think +/- 5% is basically noise on these so merging. Thanks @maartenbreddels! |
Of course, now the error msg makes sense. Thanks for finishing this! FYI: For me the benchmarks are about the same (maybe a 5% slower for IsAlphaNumericUnicode, AMD Ryzen / gcc 9.3) |
Ah, I hadn't noticed this was merged. This will need documenting in docs/source/cpp/compute.rst. |
Yes, indeed. |
Quite a few issues showed up:
These issues showed up when writing the Python test to compareagainst CPython. The test lists all the issues and all the codepoints that give issues.
I wonder what the best way forward is? Ideally, libutf8proc would implement this.
Would this be a reason to not merge this PR, or can we live with this (I'm ok with it as it is, minus the performance).
Performance:
I think the performance is reasonable for ascii, but I think we can do a lookup table for the call to
utf8proc_category(..)
, which should speed up the unicode version.Naming
I've used the
_ascii
and_unicode
suffixes, instead of_utf8
, since for semantic, we don't care about the encoding. It should not matter how the string is encoded (utf8/16/32).