Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

Closed
GrabYourPitchforks opened this issue Aug 28, 2018 · 10 comments · Fixed by #41200
Closed
Labels
area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

In a nutshell, there are certain characters where CharUnicodeInfo.GetUnicodeCategory returns the correct value, but Char.GetUnicodeCategory returns the wrong value. One such character is U+00B6 PILCROW SIGN, where CharUnicodeInfo returns OtherPunctuation (which is correct) and where Char returns OtherSymbol (which is incorrect). This also affects the behavior of dependent methods like Char.IsPunctuation and Char.IsLower.

MSDN says this behavior is intentional to preserve back-compat, but it is extraordinarily confusing to have two methods with the same name have different behavior.

One solution would be to update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory. This is a breaking change, but it's the type of breaking change that is normally allowed in side-by-side major version updates.

An alternative is to mark Char.GetUnicodeCategory, Char.IsPunctuation, Char.IsLower, etc. as obsolete and to direct users to call into CharUnicodeInfo instead. This preserves existing behavior and provides a migration story to get developers on to the APIs which provide correct results.

@jkotas
Copy link
Member

jkotas commented Aug 28, 2018

update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory

+1

@Porges
Copy link

Porges commented Aug 29, 2018

Note that other things are also frozen to an old (pre-4.0) Unicode version for compatibility, like Regex:

Regex.IsMatch("\u00b6", @"\p{So}") // pilcrow
Regex.IsMatch("\u00ad", @"\p{Pd}") // soft-hyphen

@tarekgh
Copy link
Member

tarekgh commented Sep 5, 2018

We were avoiding updating the first 127 characters Unicode categories for compatibility reason because this can break things. And we always recommend using CharUnicodeInfo in general for that.

CharUnicodeInfo is not enough for the scenarios want to get the correct category?

@GrabYourPitchforks
Copy link
Member Author

And we always recommend using CharUnicodeInfo in general for that.

@tarekgh How would our developer audience know that? Yes, it says so on MSDN, but how would they even know that they'd need to consult MSDN on this particular issue? Hell, not even I knew this, and I'm a partial owner of this feature area!

If this API is only here for compat and all new code should be using the new API, this is the exact scenario that [Obsolete] was meant for. Otherwise we're just encouraging our developer audience to continue along writing new code against old, incorrect APIs.

@tarekgh
Copy link
Member

tarekgh commented Sep 6, 2018

@GrabYourPitchforks the point is we have the same thing in the full framework and the compatibility bar is very high there which I believe we cannot change that there. changing the behavior in .Net Core can be potentially introducing different behavior when the same code runs on the full framework. At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected. I am not seeing a strong reason we need to change that now.

@jkotas
Copy link
Member

jkotas commented Sep 6, 2018

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think.

The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

@GrabYourPitchforks
Copy link
Member Author

At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected.

/me raises hand.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

@tarekgh
Copy link
Member

tarekgh commented Sep 6, 2018

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think. The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

Just to clarify, we are talking here about a few characters in the ASCII range here which will never evolve again. we are not talking about the whole Unicode characters. We already update the categories for the rest of the Unicode characters. in another word CharUnicodeInfo.GetUnicodeCategory and Char properties are already identical.

This discrepancy in the ASCII range was introduced for a good reason too in the full framework.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

good and you know how to fix it now :-)

anyway, I'll not push back more than that. I am still not fully convinced we need to do that but if everyone else see we need to be aggressive changing that, I'll not resist more.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@stephentoub
Copy link
Member

I've run into this difference recently as well. It is quite confusing that the two methods return different results.

We're working on a major release, .NET 5. It is side-by-side. We are not and do not want to be bug-for-bug compatible. Why not just fix the methods to return the same results? Who are we helping by continuing to ship the discrepancy?

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2020
@tarekgh tarekgh modified the milestones: Future, 5.0.0 Jun 17, 2020
@tarekgh
Copy link
Member

tarekgh commented Jun 17, 2020

Moving it to 5.0 release as this will be the chance to have such breaking change.

@tarekgh tarekgh added the breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. label Jun 17, 2020
@ericstj ericstj added enhancement Product code improvement that does NOT require public API changes/additions and removed question Answer questions and provide assistance, not an issue with source code or documentation. labels Aug 17, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants