Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.ToUpperInvariant("μ") gives different results on different platforms #54278

Closed
anthony-c-martin opened this issue Jun 16, 2021 · 11 comments
Labels
area-System.Globalization untriaged New issue has not been triaged by the area owner

Comments

@anthony-c-martin
Copy link
Member

Description

We're using string.ToUpperInvariant() to normalize a JSON blob for case-insensitive hashing, and noticed that if the string contains a "μ", the hash is different across different platforms.

Results from the following GitHub Actions build agents:

  • windows-latest: string.ToUpperInvariant("μ") -> μ
  • ubuntu-latest: string.ToUpperInvariant("μ") -> M
  • macos-latest: string.ToUpperInvariant("μ") -> M

I can set up a repo with GitHub Actions to repro this if that's helpful here.

Configuration

Regression?

Unsure - only noticed when exercising this functionality for the first time.

Other information

N/A

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ufcpp
Copy link
Contributor

ufcpp commented Jun 16, 2021

https://docs.microsoft.com/en-us/dotnet/core/compatibility/globalization/5.0/icu-globalization-api
https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu

<ItemGroup>
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
</ItemGroup>
using System;

// Micro Sign
Console.WriteLine(char.ToUpperInvariant('\u00B5')); // U+039C in ICU, U+00B5 in NLS

// Greek Small Letter Mu
Console.WriteLine(char.ToUpperInvariant('\u03BC')); // U+039C in both ICU and NLS

@ghost
Copy link

ghost commented Jun 16, 2021

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

We're using string.ToUpperInvariant() to normalize a JSON blob for case-insensitive hashing, and noticed that if the string contains a "μ", the hash is different across different platforms.

Results from the following GitHub Actions build agents:

  • windows-latest: string.ToUpperInvariant("μ") -> μ
  • ubuntu-latest: string.ToUpperInvariant("μ") -> M
  • macos-latest: string.ToUpperInvariant("μ") -> M

I can set up a repo with GitHub Actions to repro this if that's helpful here.

Configuration

Regression?

Unsure - only noticed when exercising this functionality for the first time.

Other information

N/A

Author: anthony-c-martin
Assignees: -
Labels:

area-System.Globalization, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

@tarekgh Here's another example of a use case that could benefit from a first-party "case folding" API, as ToUpperInvariant isn't guaranteed to remain stable across .NET releases or across versions of ICU.

@tarekgh
Copy link
Member

tarekgh commented Jun 16, 2021

@anthony-c-martin you are using Windows Server 2019 which by default not having the ICU library we depend on. That is why this you may experience the difference.

You can try including the ICU in your app when running on Windows Server 2019. You can do that by

<ItemGroup>
  <PackageReference Include="Microsoft.ICU.ICU4C.Runtime" Version="68.2.0.6" />
  <RuntimeHostConfigurationOption Include="System.Globalization.AppLocalIcu" Value="68.2" />
</ItemGroup>

This should help.

@GrabYourPitchforks

Here's another example of a use case that could benefit from a first-party "case folding" API, as ToUpperInvariant isn't guaranteed to remain stable across .NET releases or across versions of ICU.

Moving forward when users using ICU users should experience consistent results. This is not specific to the casing operation only but for collation operations too. I am not opposing the idea; I am just trying to have the casing and the collation are consistent even for Invariant. That is why I prefer if we expose the case folding as its own APIs and not tie it to invariant.

I am closing this issue as we already having issue #20674 tracking the request. @anthony-c-martin feel free to send any more questions if you have any. Thanks for your report.

@tarekgh tarekgh closed this as completed Jun 16, 2021
@majastrz
Copy link

Does the RuntimeHostConfigurationOption workaround also work with the single-file self-contained apps?

@tarekgh
Copy link
Member

tarekgh commented Jun 16, 2021

Does the RuntimeHostConfigurationOption workaround also work with the single-file self-contained apps?

Yes, it does.

@anthony-c-martin
Copy link
Member Author

anthony-c-martin commented Jun 17, 2021

@tarekgh - do you have any guidance for authors of shared libraries targeting .NET standard? In our scenario, the logic which relies upon .ToUpperInvariant() lives in a nuget package which targets netstandard2.0, and is consumed by various .NET 5 applications.

We can go and fix the applications we own which depend on this package, but is there a way we can guarantee that any consumer of this package will see consistent behavior?

@tarekgh
Copy link
Member

tarekgh commented Jun 17, 2021

@anthony-c-martin could you tell more about your scenario. you mentioned you are hashing JSON blob. Where this blob came from and how you are using the hashing of this blob?

@anthony-c-martin
Copy link
Member Author

anthony-c-martin commented Jun 17, 2021

@anthony-c-martin could you tell more about your scenario. you mentioned you are hashing JSON blob. Where this blob came from and how you are using the hashing of this blob?

It's a user-defined document (example here) which is usually either user-authored, or programmatically generated. The API which handles this document is case-insensitive, so we want a means of generating a hash to compare two documents for (case-insensitive) equivalence - mostly for telemetry. We also own some cross-platform client tools which work with templates, so we'd like to ensure we can reliably generate the same hash client-side.

The hashing logic we run on the JSON document roughly looks like:

  1. Strip non-semantic whitespace
  2. Normalize object ordering
  3. Uppercase everything (using string.ToUpperInvariant()
  4. Calculate a hash with a standard hashing algorithm

@tarekgh
Copy link
Member

tarekgh commented Jun 17, 2021

Thanks @anthony-c-martin for sharing the details. If you want to always guarantee the exact same casing/hashing between the server and client, you'll need to do one of the following:

@ghost ghost locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants