Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringComparison.OrdinalIgnoreCase compares "¡a" and "¡B" incorrectly #71018

Closed
bgrainger opened this issue Jun 20, 2022 · 6 comments · Fixed by #71022
Closed

StringComparison.OrdinalIgnoreCase compares "¡a" and "¡B" incorrectly #71018

bgrainger opened this issue Jun 20, 2022 · 6 comments · Fixed by #71022

Comments

@bgrainger
Copy link
Contributor

bgrainger commented Jun 20, 2022

Description

In net5.0 and net48 on Windows, string.Compare("¡a", "¡B", StringComparison.OrdinalIgnoreCase) returns a value < 0 (specifically -1).

But In net6.0 and net7.0, that expression returns a value > 0 (specifically, 31).

The net5.0 result fulfills the meaning of StringComparison.OrdinalIgnoreCase; the net6.0 result does not.

Setting $env:DOTNET_SYSTEM_GLOBALIZATION_USENLS='true' restores the net5.0 behaviour, which indicates that this may be related to #30960?

(Even if this is the actual result returned by ICU for the comparison--and I'm not sure if that's true or not--it doesn't match this programmer's expectations for what StringComparison.OrdinalIgnoreCase means.)

Reproduction Steps

Program.cs:

using System;

// prints -1 for prefix <= U+007F, 31 for prefix >= U+0080
string prefix = "\u00A1";
Console.WriteLine(string.Compare($"{prefix}a", $"{prefix}B", StringComparison.OrdinalIgnoreCase));

Compare.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net48;net5.0;net6.0;net7.0</TargetFrameworks>
    <LangVersion>10.0</LangVersion>
  </PropertyGroup>

</Project>

Expected behavior

Per https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings#ordinal-string-operations:

Case-insensitive ordinal comparisons are the next most conservative approach. These comparisons ignore most casing; for example, "windows" matches "Windows". When dealing with ASCII characters, this policy is equivalent to StringComparison.Ordinal, except that it ignores the usual ASCII casing. Therefore, any character in [A, Z] (\u0041-\u005A) matches the corresponding character in [a,z] (\u0061-\007A). Casing outside the ASCII range uses the invariant culture's tables.

Thus, it is expected that ¡ will be considered equal in both strings, then a will be compared to B and the first string will sort first by an case-insensitive ordinal comparison. That is, "¡a" sorts before "¡B" using a case-insensitive ordinal comparison.

Actual behavior

"¡a" sorts after "¡B" using a case-insensitive ordinal comparison.

Regression?

Yes. This worked correctly in net48 and net5.0 on Windows and Linux; I have not tested net5.0 and earlier on macOS.

Known Workarounds

Use StringComparison.InvariantCultureIgnoreCase.
Set the DOTNET_SYSTEM_GLOBALIZATION_USENLS environment variable to true.

Configuration

SDKs: 6.0.301; 7.0.100-preview.5.22307.18
Windows 10 19044.1766 x64

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 20, 2022
@ghost
Copy link

ghost commented Jun 20, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

In net5.0 and net48 on Windows, string.Compare("¡a", "¡B", StringComparison.OrdinalIgnoreCase) returns a value < 0 (specifically -1).

But In net6.0 and net7.0, that expression returns a value > 0 (specifically, 31).

The net5.0 result fulfills the meaning of StringComparison.OrdinalIgnoreCase; the net6.0 result does not.

Setting $env:DOTNET_SYSTEM_GLOBALIZATION_USENLS='true' restores the net5.0 behaviour, which indicates that this may be related to #30960?

(Even if this is the actual result returned by ICU for the comparison--and I'm not sure if that's true or not--it doesn't match this programmer's expectations for what StringComparison.OrdinalIgnoreCase means.)

Reproduction Steps

Program.cs:

using System;

// prints -1 for prefix <= U+007F, 31 for prefix >= U+0080
string prefix = "\u00A1";
Console.WriteLine(string.Compare($"{prefix}a", $"{prefix}B", StringComparison.OrdinalIgnoreCase));

Compare.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFrameworks>net48;net5.0;net6.0;net7.0</TargetFrameworks>
    <LangVersion>10.0</LangVersion>
  </PropertyGroup>

</Project>

Expected behavior

Per https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings#ordinal-string-operations:

Case-insensitive ordinal comparisons are the next most conservative approach. These comparisons ignore most casing; for example, "windows" matches "Windows". When dealing with ASCII characters, this policy is equivalent to StringComparison.Ordinal, except that it ignores the usual ASCII casing. Therefore, any character in [A, Z] (\u0041-\u005A) matches the corresponding character in [a,z] (\u0061-\007A). Casing outside the ASCII range uses the invariant culture's tables.

Thus, it is expected that ¡ will be considered equal in both strings, then a will be compared to B and the first string will sort first by an case-insensitive ordinal comparison. That is, "¡a" sorts before "¡B" using a case-insensitive ordinal comparison.

Actual behavior

"¡a" sorts after "¡B" using a case-insensitive ordinal comparison.

Regression?

Yes. This worked correctly in net48 and net5.0 on Windows; I have not tested net5.0 and earlier on macOS or Linux.

Known Workarounds

Use StringComparison.InvariantCultureIgnoreCase.
Set the DOTNET_SYSTEM_GLOBALIZATION_USENLS environment variable to true.

Configuration

SDKs: 6.0.301; 7.0.100-preview.5.22307.18
Windows 10 19044.1766 x64

Other information

No response

Author: bgrainger
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@bgrainger
Copy link
Contributor Author

The prefix can be arbitrarily long (e.g., string prefix = "\u00A1aaaaaaaaaaaaaaaaaaaaaaaa";); the common factor seems to be that case differences between different ASCII characters are counted as significant after the first character with code point U+0080 or higher.

@tarekgh
Copy link
Member

tarekgh commented Jun 20, 2022

@bgrainger thanks for filing the issue. Yes, this is a bug.

@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 20, 2022
@tarekgh tarekgh added this to the 7.0.0 milestone Jun 20, 2022
@tarekgh tarekgh self-assigned this Jun 20, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 20, 2022
@tarekgh
Copy link
Member

tarekgh commented Jun 21, 2022

@bgrainger just checking if this issue is a blocker issue for you for using .NET 6.0? I am asking to know if we need to consider porting it to .NET 6.0 or we can just fix it in .NET 7.0?

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 21, 2022
@bgrainger
Copy link
Contributor Author

No, this isn't strictly a blocker for me. It did break my code's ability to read (using a binary search) some sorted data that was compiled into the assembly as an embedded resource; however, I was able to restore functionality across all .NET versions by recompiling the assembly (and the embedded data) with InvariantCultureIgnoreCase as the sort order for that data. ( haven't timed this workaround to see if it causes a slight reduction in performance.)

IMO it is a serious regression to leave unfixed in .NET 6.0. Any code that has sorted data according to StringComparer.OrdinalIgnoreCase and persisted it could (a) fail to read it correctly when running under .NET 6, or (b) create invalid data if building it under .NET 6.0. Thanks for your consideration.

@tarekgh
Copy link
Member

tarekgh commented Jun 21, 2022

@bgrainger thanks again for the information. We have decided to wait a little to see if there is a demand or more scenarios broken because of that. We'll keep our eyes open on that. If you came across more data regarding this issue, please share it with us. This can help in deciding in porting the fix.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants