Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Uri.GetHashCode #32713

Merged
merged 6 commits into from
Apr 29, 2020
Merged

Conversation

MihaZupan
Copy link
Member

Use text.GetHashCode(StringComparison.OrdinalIgnoreCase) instead of text.ToLowerInvariant().GetHashCode().

For relative Uris also cache the computed hash like we do for absolute Uris. This makes dictionary lookups on relative Uris perform 2-5000 times faster depending on length (field access instead of string allocation and hash computation).

For absolute Uris, cache the RemoteUrl we just computed instead of discarding it.

@MihaZupan MihaZupan added this to the 5.0 milestone Feb 23, 2020
@MihaZupan MihaZupan requested review from stephentoub and a team February 23, 2020 14:23
@stephentoub
Copy link
Member

Use text.GetHashCode(StringComparison.OrdinalIgnoreCase) instead of text.ToLowerInvariant().GetHashCode().

Does any form of Uri equality check use invariant culture? If yes, this could cause a break, as instances that are considered equal may no longer map to the same hash code.

@stephentoub
Copy link
Member

cc: @tarekgh, @GrabYourPitchforks

@MihaZupan
Copy link
Member Author

Does any form of Uri equality check use invariant culture?

No, it uses Ordinal/OrdinalIgnoreCase

@stephentoub
Copy link
Member

For relative Uris also cache the computed hash like we do for absolute Uris.

It appears this was being special-cased very explicitly. Do we know why?

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 23, 2020

It appears this was being special-cased very explicitly. Do we know why?

For relative Uris, the equality comparison is an ordinal comparison of the entire string, so GetHashCode need not do any more processing and it is enough to only compute the case-insensitive hash of the string.

For AbsoluteUri, the equality check ultimately comes down to (sometimes case-insensitive) ordinal equality of a normalized representation of everything but the fragment and UserInfo. That is what the call to GetParts(UriComponents.HttpRequestUrl, UriFormat.SafeUnescaped) is fetching.

@davidsh
Copy link
Contributor

davidsh commented Feb 23, 2020

If yes, this could cause a break

Changing the hash codes of existing Uri's is a 'breaking change'. We know that we declined to do things like this in .NET Framework between versions. Many of our customers have databases with Uri's in them. So, we should be careful here.

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 23, 2020

In fact, a relative Uri can never be equal to an absolute one, and they are matched using an ordinal, case-sensitive comparison, so there is no need for us to compute a non-case-sensitive hash for GetHashCode (for relative Uris).

@MihaZupan
Copy link
Member Author

Changing the hash codes of existing Uri's is a 'breaking change'. We know that we declined to do things like this in .NET Framework between versions. Many of our customers have databases with Uri's in them. So, we should be careful here.

As the hash computation is handed over to string hashing, that has already been broken when the change to randomized string hashing was introduced.

Also, GetHashCode documentation has very clear warnings against such use.

@scalablecory
Copy link
Contributor

As the hash computation is handed over to string hashing, that has already been broken when the change to randomized string hashing was introduced.

Perhaps @GrabYourPitchforks can help us understand what we learned from doing that. Maybe we are happy with it, or maybe we would rethink doing it again. At the very least we could better understand what sort of response we would get from our users.

@GrabYourPitchforks
Copy link
Member

As a reminder, I explicitly did not change this code path as part of the related work in #31968 because someString.ToLowerInvariant().GetHashCode() and someString.GetHashCode(StringComparison.OrdinalIgnoreCase) are behaviorally different.

Now, it's possible that this code path should've been using GetHashCode(OrdinalIgnoreCase) all along. But that would be a bug fix (and I wasn't qualified to make a judgment call on whether the original URI behavior was incorrect), not a perf improvement.

@GrabYourPitchforks
Copy link
Member

To expand a little bit on what I said above, the only required relationship between object.Equals and object.GetHashCode is as follows.

Object.Equals(a, b) ⇒ a.GetHashCode() == a.GetHashCode()
StringComparer.FromComparison(comparison).Equals(a, b) ⇒ a.GetHashCode(comparison) == b.GetHashCode(comparison)

And the contrapositives:

a.GetHashCode() != b.GetHashCode() ⇒ !Object.Equals(a, b)
a.GetHashCode(comparison) != b.GetHashCode(comparison) ⇒ !StringComparer.FromComparison(comparison).Equals(a, b)

If you have a call to someString.ToLowerInvariant().GetHashCode(), we would expect to see it paired with the expression stringA.ToLowerInvariant() == stringB.ToLowerInvariant() at some point. Similarly, if you have a call to someString.GetHashCode(StringComparison.OrdinalIgnoreCase), we would expect to see it paired with the expression string.Equals(stringA, stringB, StringComparison.OrdinalIgnoreCase) at some point.

Importantly, these two expressions do not have the same behavior. Converting two strings to lowercase and checking them for equality is not the same as checking the two strings for equality using an OrdinalIgnoreCase comparison.

For example depending on operating system and environmental configuration:

"ς".ToLowerInvariant() == "σ".ToLowerInvariant(); // currently always FALSE across all supported OSes
string.Equals("ς", "σ", StringComparison.OrdinalIgnoreCase); // may return TRUE depending on OS

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 23, 2020

Importantly, these two expressions do not have the same behavior. Converting two strings to lowercase and checking them for equality is not the same as checking the two strings for equality using an OrdinalIgnoreCase comparison.

Now, it's possible that this code path should've been using GetHashCode(OrdinalIgnoreCase) all along. But that would be a bug fix (and I wasn't qualified to make a judgment call on whether the original URI behavior was incorrect), not a perf improvement.

The current GetHashCode/Equals does not honor that. It uses ToLowerInvariant for GetHashCode and Ordinal(IgnoreCase) for equals. So that makes this change a bug-fix as well.

@MihaZupan
Copy link
Member Author

The failing test assumes that GetHashCode is always ignoring casing, which is no longer true after this change. I will update the test to only expect that for absolute Uris that are Unc or Dos paths.

if ((object?)info.MoreInfo == null)
{
info.MoreInfo = new MoreInfo();
MoreInfo moreInfo = (_info ??= new UriInfo()).MoreInfo ??= new MoreInfo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code path, just return OriginalString.GetHashCode() directly. There's no need to complicate the code path by caching the result. string.GetHashCode() is already fairly efficient, and if we need to optimize it even further by introducing caching we should make the improvements directly in string itself, not here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a dedicated field on the MoreInfo for caching this hash, it just wasn't being stored for relative Uris before.

Having a cached hash can also be used as another fast-path optimization in Equals.

There's no need to complicate the code path by caching the result. string.GetHashCode() is already fairly efficient

Should we also not cache it for AbsoluteUris? If so, we can remove that field to save a pointer size on the allocation by trading GetHashCode/Equals perf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a dedicated field on the MoreInfo for caching this hash

I'd guess that this caching mechanism wasn't intended to work around any perceived slowness in string.GetHashCode as much as it was to work around the possible allocation in the below code path.

internal static int CalculateCaseInsensitiveHashCode(string text)
{
return text.ToLowerInvariant().GetHashCode();
}

For absolute URIs there's still the possibility for allocation in the hash code computation routine, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For absolute URIs there's still the possibility for allocation in the hash code computation routine, right?

Yes, but only once. GetHashCode and Equals calls will cache the value in MoreInfo.RemoteUrl if/when they compute it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess that this caching mechanism wasn't intended to work around any perceived slowness in string.GetHashCode as much as it was to work around the possible allocation in the below code path.

I don't know why it was not cached for the relative uri path, I assume it's because Info and MoreInfo are not used for anything by relative Uris otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the caching of the hash.

chkString = GetParts(UriComponents.HttpRequestUrl, UriFormat.SafeUnescaped);
tempHash = CalculateCaseInsensitiveHashCode(chkString);
if (tempHash == 0)
MoreInfo moreInfo = EnsureUriInfo().MoreInfo ??= new MoreInfo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub - how far down the rabbit hole do we want to go here with regard to null coalescing syntax? This is a clever application but I admit it might affect maintainability of code since it involves null coalescing plus a possible assignment to a field on an object (not this) that's returned as part of a function call plus a dereference of that field, all in the same line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to not take it too far :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #33042, it becomes MoreInfo moreInfo = EnsureUriInfo().MoreInfo;

@GrabYourPitchforks
Copy link
Member

I don't have concerns about using randomized hash code calculations here. The updated call sites look fairly straightforward. The NCL team should double-check the comparers used to make sure they're appropriate for the specific scenarios. (Thanks @MihaZupan for finding the discrepancy!)

@MihaZupan
Copy link
Member Author

MihaZupan commented Feb 24, 2020

I will likely be opening a PR focused on the Equals tomorrow soon

Hash codes being equal doesn't mean that the values are equal
@MihaZupan MihaZupan merged commit b4b2c46 into dotnet:master Apr 29, 2020
@MihaZupan MihaZupan mentioned this pull request May 4, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants