-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include case information from DerivedCoreProperties.txt #195
Comments
A simple workaround seems like it would be (It would be interesting to check whether that rule is equivalent to checking the flag in It seems like anything that needed something fancier than that would need locale information etcetera too… |
I assumed it didn't work (reading throught the unicode spec), but for upper case testing it does actually: return (HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LU) ||
((static_cast<uint32_t>(utf8proc_toupper(codepoint)) == codepoint) &&
(static_cast<uint32_t>(utf8proc_tolower(codepoint)) != codepoint))) &&
!HasAnyUnicodeGeneralCategory(codepoint, UTF8PROC_CATEGORY_LT); I think that's more out of luck, the equivalent for lower casing doesn't work, one example is U+00AA, which is listed as lower case, but has no upper case version. https://graphemica.com/%C2%AA I think locale information is a step too far indeed. Do you think storing if something is lower or upper case is out of scope for utf8proc? |
What is the application of knowing whether U+00AA is lower case? Before deciding whether to include something, it's helpful to know a practical application in order to understand the needed functionality. |
For example, if you're trying to detect proper nouns in text, then characters like U+00AA without an upper-case version do not seem to be a concern. (Moreover, no locale-unaware test for a proper noun seems likely to be completely reliable, so it's odd to worry about bizarre Unicode corner cases for noun detection if you aren't bothering with locales.) It's certainly possible to add this information to utf8proc, but I'm reluctant to expand the data tables to include information that may be practically useless (compared to what we already provide). |
I am translating Python semantics to Apache Arrow. Pandas dataframes use Python for string manipulations, to get better performance, like str.islower(). So it's not that I have a use case in mind now, but it would be easier to explain if we have no exceptions. So, it's not that I have a use case in mind next, but a lot of data scientists will rely on this code, having to explain less, and being conformant to a spec (if I am not mistaken) is a big +. |
That's reasonable. We can probably store this information without changing our data structure just by storing sentinel values in the |
Hi,
Roman letter Ⅰ to Ⅿ and circled letters Ⓐ to Ⓩ are upper case characters, which should be listed in DerivedCoreProperties.txt, and many more for lower case codepoints. Are there plans to include this information?
Regards,
Maarten
ref: apache/arrow#7656
The text was updated successfully, but these errors were encountered: