Include case information from DerivedCoreProperties.txt #195

maartenbreddels · 2020-07-07T12:28:04Z

Hi,

Roman letter Ⅰ to Ⅿ and circled letters Ⓐ to Ⓩ are upper case characters, which should be listed in DerivedCoreProperties.txt, and many more for lower case codepoints. Are there plans to include this information?

Regards,

Maarten

ref: apache/arrow#7656

stevengj · 2020-07-08T03:19:14Z

A simple workaround seems like it would be isupper(c) = uppercase(c) == c && lowercase(c) != c , using the utf8proc uppercase and lowercase functions, and conversely for islower… is there an application where this would not be sufficient?

(It would be interesting to check whether that rule is equivalent to checking the flag in DerivedCoreProperties.txt … it works for the examples you listed.)

It seems like anything that needed something fancier than that would need locale information etcetera too…

maartenbreddels · 2020-07-08T08:03:19Z

I assumed it didn't work (reading throught the unicode spec), but for upper case testing it does actually:

  return (HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LU) ||
          ((static_cast<uint32_t>(utf8proc_toupper(codepoint)) == codepoint) &&
           (static_cast<uint32_t>(utf8proc_tolower(codepoint)) != codepoint))) &&
         !HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LT);

I think that's more out of luck, the equivalent for lower casing doesn't work, one example is U+00AA, which is listed as lower case, but has no upper case version. https://graphemica.com/%C2%AA

I think locale information is a step too far indeed. Do you think storing if something is lower or upper case is out of scope for utf8proc?

stevengj · 2020-07-08T14:04:08Z

What is the application of knowing whether U+00AA is lower case?

Before deciding whether to include something, it's helpful to know a practical application in order to understand the needed functionality.

stevengj · 2020-07-08T14:08:50Z

For example, if you're trying to detect proper nouns in text, then characters like U+00AA without an upper-case version do not seem to be a concern. (Moreover, no locale-unaware test for a proper noun seems likely to be completely reliable, so it's odd to worry about bizarre Unicode corner cases for noun detection if you aren't bothering with locales.)

It's certainly possible to add this information to utf8proc, but I'm reluctant to expand the data tables to include information that may be practically useless (compared to what we already provide).

maartenbreddels · 2020-07-08T14:53:50Z

I am translating Python semantics to Apache Arrow. Pandas dataframes use Python for string manipulations, to get better performance, like str.islower(). So it's not that I have a use case in mind now, but it would be easier to explain if we have no exceptions.

So, it's not that I have a use case in mind next, but a lot of data scientists will rely on this code, having to explain less, and being conformant to a spec (if I am not mistaken) is a big +.

stevengj · 2020-07-08T19:00:41Z

That's reasonable.

We can probably store this information without changing our data structure just by storing sentinel values in the uppercase_seqindex and utf8proc_uint16_t lowercase_seqindex fields…

maartenbreddels mentioned this issue Jul 10, 2020

ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels apache/arrow#7656

Closed

stevengj mentioned this issue Jul 10, 2020

add islower/isupper functions #196

Merged

stevengj closed this as completed in #196 Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include case information from DerivedCoreProperties.txt #195

Include case information from DerivedCoreProperties.txt #195

maartenbreddels commented Jul 7, 2020

stevengj commented Jul 8, 2020 •

edited

Loading

maartenbreddels commented Jul 8, 2020

stevengj commented Jul 8, 2020 •

edited

Loading

stevengj commented Jul 8, 2020 •

edited

Loading

maartenbreddels commented Jul 8, 2020

stevengj commented Jul 8, 2020 •

edited

Loading

Include case information from DerivedCoreProperties.txt #195

Include case information from DerivedCoreProperties.txt #195

Comments

maartenbreddels commented Jul 7, 2020

stevengj commented Jul 8, 2020 • edited Loading

maartenbreddels commented Jul 8, 2020

stevengj commented Jul 8, 2020 • edited Loading

stevengj commented Jul 8, 2020 • edited Loading

maartenbreddels commented Jul 8, 2020

stevengj commented Jul 8, 2020 • edited Loading

stevengj commented Jul 8, 2020 •

edited

Loading

stevengj commented Jul 8, 2020 •

edited

Loading

stevengj commented Jul 8, 2020 •

edited

Loading

stevengj commented Jul 8, 2020 •

edited

Loading