Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include case information from DerivedCoreProperties.txt #195

Closed
maartenbreddels opened this issue Jul 7, 2020 · 6 comments · Fixed by #196
Closed

Include case information from DerivedCoreProperties.txt #195

maartenbreddels opened this issue Jul 7, 2020 · 6 comments · Fixed by #196

Comments

@maartenbreddels
Copy link

Hi,

Roman letter Ⅰ to Ⅿ and circled letters Ⓐ to Ⓩ are upper case characters, which should be listed in DerivedCoreProperties.txt, and many more for lower case codepoints. Are there plans to include this information?

Regards,

Maarten

ref: apache/arrow#7656

@stevengj
Copy link
Member

stevengj commented Jul 8, 2020

A simple workaround seems like it would be isupper(c) = uppercase(c) == c && lowercase(c) != c , using the utf8proc uppercase and lowercase functions, and conversely for islower… is there an application where this would not be sufficient?

(It would be interesting to check whether that rule is equivalent to checking the flag in DerivedCoreProperties.txt … it works for the examples you listed.)

It seems like anything that needed something fancier than that would need locale information etcetera too…

@maartenbreddels
Copy link
Author

I assumed it didn't work (reading throught the unicode spec), but for upper case testing it does actually:

  return (HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LU) ||
          ((static_cast<uint32_t>(utf8proc_toupper(codepoint)) == codepoint) &&
           (static_cast<uint32_t>(utf8proc_tolower(codepoint)) != codepoint))) &&
         !HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LT);

I think that's more out of luck, the equivalent for lower casing doesn't work, one example is U+00AA, which is listed as lower case, but has no upper case version. https://graphemica.com/%C2%AA

I think locale information is a step too far indeed. Do you think storing if something is lower or upper case is out of scope for utf8proc?

@stevengj
Copy link
Member

stevengj commented Jul 8, 2020

What is the application of knowing whether U+00AA is lower case?

Before deciding whether to include something, it's helpful to know a practical application in order to understand the needed functionality.

@stevengj
Copy link
Member

stevengj commented Jul 8, 2020

For example, if you're trying to detect proper nouns in text, then characters like U+00AA without an upper-case version do not seem to be a concern. (Moreover, no locale-unaware test for a proper noun seems likely to be completely reliable, so it's odd to worry about bizarre Unicode corner cases for noun detection if you aren't bothering with locales.)

It's certainly possible to add this information to utf8proc, but I'm reluctant to expand the data tables to include information that may be practically useless (compared to what we already provide).

@maartenbreddels
Copy link
Author

I am translating Python semantics to Apache Arrow. Pandas dataframes use Python for string manipulations, to get better performance, like str.islower(). So it's not that I have a use case in mind now, but it would be easier to explain if we have no exceptions.

So, it's not that I have a use case in mind next, but a lot of data scientists will rely on this code, having to explain less, and being conformant to a spec (if I am not mistaken) is a big +.

@stevengj
Copy link
Member

stevengj commented Jul 8, 2020

That's reasonable.

We can probably store this information without changing our data structure just by storing sentinel values in the uppercase_seqindex and utf8proc_uint16_t lowercase_seqindex fields…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants