Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

GrabYourPitchforks · 2018-08-28T00:05:24Z

In a nutshell, there are certain characters where CharUnicodeInfo.GetUnicodeCategory returns the correct value, but Char.GetUnicodeCategory returns the wrong value. One such character is U+00B6 PILCROW SIGN, where CharUnicodeInfo returns OtherPunctuation (which is correct) and where Char returns OtherSymbol (which is incorrect). This also affects the behavior of dependent methods like Char.IsPunctuation and Char.IsLower.

MSDN says this behavior is intentional to preserve back-compat, but it is extraordinarily confusing to have two methods with the same name have different behavior.

One solution would be to update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory. This is a breaking change, but it's the type of breaking change that is normally allowed in side-by-side major version updates.

An alternative is to mark Char.GetUnicodeCategory, Char.IsPunctuation, Char.IsLower, etc. as obsolete and to direct users to call into CharUnicodeInfo instead. This preserves existing behavior and provides a migration story to get developers on to the APIs which provide correct results.

The text was updated successfully, but these errors were encountered:

jkotas · 2018-08-28T00:24:15Z

update Char.GetUnicodeCategory to stay in sync with CharUnicodeInfo.GetUnicodeCategory

+1

Porges · 2018-08-29T09:59:51Z

Note that other things are also frozen to an old (pre-4.0) Unicode version for compatibility, like Regex:

Regex.IsMatch("\u00b6", @"\p{So}") // pilcrow
Regex.IsMatch("\u00ad", @"\p{Pd}") // soft-hyphen

tarekgh · 2018-09-05T19:54:50Z

We were avoiding updating the first 127 characters Unicode categories for compatibility reason because this can break things. And we always recommend using CharUnicodeInfo in general for that.

CharUnicodeInfo is not enough for the scenarios want to get the correct category?

GrabYourPitchforks · 2018-09-06T00:38:13Z

And we always recommend using CharUnicodeInfo in general for that.

@tarekgh How would our developer audience know that? Yes, it says so on MSDN, but how would they even know that they'd need to consult MSDN on this particular issue? Hell, not even I knew this, and I'm a partial owner of this feature area!

If this API is only here for compat and all new code should be using the new API, this is the exact scenario that [Obsolete] was meant for. Otherwise we're just encouraging our developer audience to continue along writing new code against old, incorrect APIs.

tarekgh · 2018-09-06T00:48:36Z

@GrabYourPitchforks the point is we have the same thing in the full framework and the compatibility bar is very high there which I believe we cannot change that there. changing the behavior in .Net Core can be potentially introducing different behavior when the same code runs on the full framework. At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected. I am not seeing a strong reason we need to change that now.

jkotas · 2018-09-06T00:59:17Z

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think.

The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

GrabYourPitchforks · 2018-09-06T01:22:27Z

At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected.

/me raises hand.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

tarekgh · 2018-09-06T15:36:15Z

The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think. The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework.

Just to clarify, we are talking here about a few characters in the ASCII range here which will never evolve again. we are not talking about the whole Unicode characters. We already update the categories for the rest of the Unicode characters. in another word CharUnicodeInfo.GetUnicodeCategory and Char properties are already identical.

This discrepancy in the ASCII range was introduced for a good reason too in the full framework.

I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did.

good and you know how to fix it now :-)

anyway, I'll not push back more than that. I am still not fully convinced we need to do that but if everyone else see we need to be aggressive changing that, I'll not resist more.

stephentoub · 2020-02-04T06:38:22Z

I've run into this difference recently as well. It is quite confusing that the two methods return different results.

We're working on a major release, .NET 5. It is side-by-side. We are not and do not want to be bug-for-bug compatible. Why not just fix the methods to return the same results? Who are we helping by continuing to ship the discrepancy?

tarekgh · 2020-06-17T01:10:11Z

Moving it to 5.0 release as this will be the chance to have such breaking change.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020

tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2020

tarekgh modified the milestones: Future, 5.0.0 Jun 17, 2020

tarekgh added the breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. label Jun 17, 2020

ericstj added enhancement Product code improvement that does NOT require public API changes/additions and removed question Answer questions and provide assistance, not an issue with source code or documentation. labels Aug 17, 2020

tarekgh mentioned this issue Aug 22, 2020

Fix Char.GetUnicodeCategory to returns correct results #41200

Merged

tarekgh closed this as completed in #41200 Aug 24, 2020

github-actions bot mentioned this issue Aug 24, 2020

[release/5.0] Fix Char.GetUnicodeCategory to returns correct results #41285

Merged

tarekgh mentioned this issue Aug 25, 2020

Fix \u00A7 char Unicode category #41343

Merged

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

GrabYourPitchforks commented Aug 28, 2018

jkotas commented Aug 28, 2018

Porges commented Aug 29, 2018

tarekgh commented Sep 5, 2018

GrabYourPitchforks commented Sep 6, 2018

tarekgh commented Sep 6, 2018

jkotas commented Sep 6, 2018

GrabYourPitchforks commented Sep 6, 2018

tarekgh commented Sep 6, 2018

stephentoub commented Feb 4, 2020

tarekgh commented Jun 17, 2020

Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990

Comments

GrabYourPitchforks commented Aug 28, 2018

jkotas commented Aug 28, 2018

Porges commented Aug 29, 2018

tarekgh commented Sep 5, 2018

GrabYourPitchforks commented Sep 6, 2018

tarekgh commented Sep 6, 2018

jkotas commented Sep 6, 2018

GrabYourPitchforks commented Sep 6, 2018

tarekgh commented Sep 6, 2018

stephentoub commented Feb 4, 2020

tarekgh commented Jun 17, 2020