-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Char.GetUnicodeCategory returns wrong category for certain Latin-1 characters #10990
Comments
+1 |
Note that other things are also frozen to an old (pre-4.0) Unicode version for compatibility, like Regex: Regex.IsMatch("\u00b6", @"\p{So}") // pilcrow
Regex.IsMatch("\u00ad", @"\p{Pd}") // soft-hyphen |
We were avoiding updating the first 127 characters Unicode categories for compatibility reason because this can break things. And we always recommend using CharUnicodeInfo in general for that. CharUnicodeInfo is not enough for the scenarios want to get the correct category? |
@tarekgh How would our developer audience know that? Yes, it says so on MSDN, but how would they even know that they'd need to consult MSDN on this particular issue? Hell, not even I knew this, and I'm a partial owner of this feature area! If this API is only here for compat and all new code should be using the new API, this is the exact scenario that |
@GrabYourPitchforks the point is we have the same thing in the full framework and the compatibility bar is very high there which I believe we cannot change that there. changing the behavior in .Net Core can be potentially introducing different behavior when the same code runs on the full framework. At the same time in my experience, we didn't get complains about that. I am not sure we are benefiting much of changing that now. anyone runs into issues because of that, can easily be corrected. I am not seeing a strong reason we need to change that now. |
The Unicode categories evolve over time, like all other globalization data. I do not think we want to be freezing the globalization data in .NET Core. The globalization data in .NET Core should be always current. Yes, it will make the behavior different from .NET Framework and that is by design, I think. The compatibility constrains of .NET Core are different from compatibility constrains of .NET Framework. |
/me raises hand. I ran into issues. I am complaining. I am making noise. I am proposing two solutions, one of which perfectly preserves compatibility while making sure that other customers don't run into the same issues I did. |
Just to clarify, we are talking here about a few characters in the ASCII range here which will never evolve again. we are not talking about the whole Unicode characters. We already update the categories for the rest of the Unicode characters. in another word CharUnicodeInfo.GetUnicodeCategory and Char properties are already identical. This discrepancy in the ASCII range was introduced for a good reason too in the full framework.
good and you know how to fix it now :-) anyway, I'll not push back more than that. I am still not fully convinced we need to do that but if everyone else see we need to be aggressive changing that, I'll not resist more. |
I've run into this difference recently as well. It is quite confusing that the two methods return different results. We're working on a major release, .NET 5. It is side-by-side. We are not and do not want to be bug-for-bug compatible. Why not just fix the methods to return the same results? Who are we helping by continuing to ship the discrepancy? |
Moving it to 5.0 release as this will be the chance to have such breaking change. |
In a nutshell, there are certain characters where
CharUnicodeInfo.GetUnicodeCategory
returns the correct value, butChar.GetUnicodeCategory
returns the wrong value. One such character is U+00B6 PILCROW SIGN, whereCharUnicodeInfo
returns OtherPunctuation (which is correct) and whereChar
returns OtherSymbol (which is incorrect). This also affects the behavior of dependent methods likeChar.IsPunctuation
andChar.IsLower
.MSDN says this behavior is intentional to preserve back-compat, but it is extraordinarily confusing to have two methods with the same name have different behavior.
One solution would be to update
Char.GetUnicodeCategory
to stay in sync withCharUnicodeInfo.GetUnicodeCategory
. This is a breaking change, but it's the type of breaking change that is normally allowed in side-by-side major version updates.An alternative is to mark
Char.GetUnicodeCategory
,Char.IsPunctuation
,Char.IsLower
, etc. as obsolete and to direct users to call intoCharUnicodeInfo
instead. This preserves existing behavior and provides a migration story to get developers on to the APIs which provide correct results.The text was updated successfully, but these errors were encountered: