-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need APIs for unicode case mapping and case folding #20674
Comments
Could we start designing the simple case folding API? |
@iSazonov wouldn't the supported string normalization and casing APIs are not enough in your case? |
@tarekgh What APIs are you suggesting approximate case folding? |
We already have a casing APIs on TextInfo and String class. I understand we use the underlying OS for that and this is by design as we need to avoid carrying any globalization data as possible. what exactly the scenario that is not working for you? |
The scenario is Case Folding as specified in the Unicode specification. See also
|
I know that, I am asking about what is your scenario you need the case folding and case mapping is not enough? |
The scenario is case folding for the purposes of comparing identifiers in a case insensitive programming language according to the rules described in the Unicode specification. Case mapping is not enough; that is explained in some detail in the Unicode specification. There is no case folding API today. |
Thanks @gafter Considering we didn't get high demand on this functionality and we'll need to carry some Unicode data to be able to support it, I would suggest this functionality can be implemented in some NuGet package and doesn't have to be in the core for now. If we see there is more demand on that we can consider having it in core. we can use corefxlab if needed for hosting the code if needed. |
Please see https://github.com/dotnet/corefx/issues/33047 and PowerShell/PowerShell#8120 In PowerShell/PowerShell#8120 I provide a plan. In short we hope to get performance benefits in PowerShell, RegEx and string comparisons in general. On the week I prepared alfa code for simple case folding and while the results are encouraging. |
See also https://github.com/dotnet/corefx/issues/41333, which is a proposal to make |
If we did want to provide APIs for this, they'd probably look like the following: public namespace System.Globalization
{
public static class CharUnicodeInfo
{
// performs "simple" culture-unaware mapping to uppercase using CharUnicodeInfo backing data
public static char ToUpperInvariant(char ch);
public static int ToUpperInvariant(int codePoint);
public static Rune ToUpperInvariant(Rune rune);
// performs "simple" culture-unaware mapping to lowercase using CharUnicodeInfo backing data
public static char ToLowerInvariant(char ch);
public static int ToLowerInvariant(int codePoint);
public static Rune ToLowerInvariant(Rune rune);
// performs "simple" culture-unaware mapping to titlecase using CharUnicodeInfo backing data
public static char ToTitleInvariant(char ch);
public static int ToTitleInvariant(int codePoint);
public static Rune ToTitleInvariant(Rune rune);
// performs "simple" case-fold mapping using CharUnicodeInfo backing data
public static char ToCaseFold(char ch);
public static int ToCaseFold(int codePoint);
public static Rune ToCaseFold(Rune rune);
}
} The behavior of all of these APIs is that they'd use the data that's already present in the code-behind for |
@GrabYourPitchforks would it make sense to move these functionality to char and string classes (and support string case folding too)? |
@tarekgh If we did https://github.com/dotnet/corefx/issues/41333, I think that would make sense. Otherwise we could end up with a mixture of ICU and NLS behaviors hanging off of |
I believe in the near future we'll use ICU across all platforms which will help not having this mixture issue. |
I don't understand. Unicode has no concept of an invariant mapping (i.e. one that that fails to follow the evolving Unicode standard). |
I'm using the term "invariant" to mean "simple" case mapping in a culture-agnostic fashion. This mapping is subject to change depending on which version of the Unicode Standard is pulled in by System.Private.CoreLib.dll. Check out my comment at https://github.com/dotnet/corefx/issues/41333#issuecomment-535274401 for more details on the proposed behavior of these APIs. |
@gafter Follow-up question: Would it be acceptable for .NET Core to carry these tables as an internal implementation (e.g., we say ".NET 5's case folding APIs always use the Unicode 12.1 tables"), or do you require the ability to swap out the version of ICU which is used under the covers? |
@GrabYourPitchforks I think that would work very well for us. |
@tarekgh As a straw man, here's what it might look like to spread these APIs out on the classes where they naturally belong: namespace System
{
public struct char
{
public static char ToCaseFold(char c);
}
public static class MemoryExtensions
{
public static int ToCaseFold(this ReadOnlySpan<char> source, Span<char> destination);
}
public class string
{
public string ToCaseFold();
}
}
namespace System.Globalization
{
public static class CharUnicodeInfo
{
public static int ToCaseFold(int codePoint);
}
public class TextInfo
{
public static char ToCaseFold(char c);
public static string ToCaseFold(string value);
}
}
namespace System.Text
{
public struct Rune
{
public static Rune ToCaseFold(Rune value);
}
} If we were to follow these patterns, then we'd probably end up with this behavior:
This follows the general concept of Personally I think it's a little confusing to have multiple APIs that all share the same name but do slightly different things, but it does follow established convention. Thoughts? |
Thanks, @GrabYourPitchforks for your proposal. I talked with you offline, I just want to mention what is the idea I like to have. Moving the CaseFolding Library from corefxlab to runtime libraries repo but not including it as part of the shipping SDK and instead, we ship it as a separate independent NuGet package so anyone wants to get the case-folding feature will just need to depend on this package. The reason is we are going to carry the case folding Unicode tables which will increase the footprint size of the product if we ship it inside the SDK. So only apps/libraries need to use this feature will need to carry it. Here are some comments if we go with this plan:
Thanks for your thoughts on this issue. |
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process. This process is part of our issue cleanup automation. |
This issue will now be closed since it had been marked |
@gafter commented on Mon May 11 2015
The Unicode function for determining the category of Unicode characters (
System.Globalization.CharUnicodeInfo.GetUnicodeCategory(char)
) appears to reflect a fairly recent version of the Unicode standard. However, the functions for performing Unicode case mapping (culture-insensitive uppercase and lowercase) appear to reflect Unicode version 1.0. This mismatch was a severe impediment for implementing spec-compliant case folding in the VB compiler.There is also no support whatsoever for performing case folding according to the Unicode specification (simplified or otherwise). APIs in the platform to do this would help with any language that, like VB, wants to depend on the Unicode specification for the meaning of case-insensitive identifiers.
@jkotas commented on Mon May 11 2015
cc @ellismg
@ellismg commented on Mon May 11 2015
This is something I'd like to do. I think the open issue is how we support this on Windows (since I think that we can get Case Folding functionality from ICU).
In general, we've tried to move away from the framework itself shipping globalization data and used the OS provided APIs instead. Since Windows doesn't currently expose a way to do case folding, we'd have to figure out what to do.
@karelz commented on Fri Mar 17 2017
@tarekgh do we need a new API for this? If yes, we should move it to CoreFX. If not, let's remove "api addition" label.
The text was updated successfully, but these errors were encountered: