-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Uri.GetHashCode #32713
Optimize Uri.GetHashCode #32713
Conversation
Does any form of Uri equality check use invariant culture? If yes, this could cause a break, as instances that are considered equal may no longer map to the same hash code. |
No, it uses Ordinal/OrdinalIgnoreCase |
It appears this was being special-cased very explicitly. Do we know why? |
For relative Uris, the equality comparison is an ordinal comparison of the entire string, so GetHashCode need not do any more processing and it is enough to only compute the case-insensitive hash of the string. For AbsoluteUri, the equality check ultimately comes down to (sometimes case-insensitive) ordinal equality of a normalized representation of everything but the fragment and UserInfo. That is what the call to |
Changing the hash codes of existing Uri's is a 'breaking change'. We know that we declined to do things like this in .NET Framework between versions. Many of our customers have databases with Uri's in them. So, we should be careful here. |
In fact, a relative Uri can never be equal to an absolute one, and they are matched using an ordinal, case-sensitive comparison, so there is no need for us to compute a non-case-sensitive hash for GetHashCode (for relative Uris). |
As the hash computation is handed over to string hashing, that has already been broken when the change to randomized string hashing was introduced. Also, GetHashCode documentation has very clear warnings against such use. |
Perhaps @GrabYourPitchforks can help us understand what we learned from doing that. Maybe we are happy with it, or maybe we would rethink doing it again. At the very least we could better understand what sort of response we would get from our users. |
As a reminder, I explicitly did not change this code path as part of the related work in #31968 because Now, it's possible that this code path should've been using |
To expand a little bit on what I said above, the only required relationship between Object.Equals(a, b) ⇒ a.GetHashCode() == a.GetHashCode()
StringComparer.FromComparison(comparison).Equals(a, b) ⇒ a.GetHashCode(comparison) == b.GetHashCode(comparison) And the contrapositives: a.GetHashCode() != b.GetHashCode() ⇒ !Object.Equals(a, b)
a.GetHashCode(comparison) != b.GetHashCode(comparison) ⇒ !StringComparer.FromComparison(comparison).Equals(a, b) If you have a call to Importantly, these two expressions do not have the same behavior. Converting two strings to lowercase and checking them for equality is not the same as checking the two strings for equality using an For example depending on operating system and environmental configuration: "ς".ToLowerInvariant() == "σ".ToLowerInvariant(); // currently always FALSE across all supported OSes
string.Equals("ς", "σ", StringComparison.OrdinalIgnoreCase); // may return TRUE depending on OS |
The current GetHashCode/Equals does not honor that. It uses ToLowerInvariant for GetHashCode and Ordinal(IgnoreCase) for equals. So that makes this change a bug-fix as well. |
The failing test assumes that GetHashCode is always ignoring casing, which is no longer true after this change. I will update the test to only expect that for absolute Uris that are Unc or Dos paths. |
if ((object?)info.MoreInfo == null) | ||
{ | ||
info.MoreInfo = new MoreInfo(); | ||
MoreInfo moreInfo = (_info ??= new UriInfo()).MoreInfo ??= new MoreInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this code path, just return OriginalString.GetHashCode()
directly. There's no need to complicate the code path by caching the result. string.GetHashCode()
is already fairly efficient, and if we need to optimize it even further by introducing caching we should make the improvements directly in string
itself, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a dedicated field on the MoreInfo for caching this hash, it just wasn't being stored for relative Uris before.
Having a cached hash can also be used as another fast-path optimization in Equals.
There's no need to complicate the code path by caching the result. string.GetHashCode() is already fairly efficient
Should we also not cache it for AbsoluteUris? If so, we can remove that field to save a pointer size on the allocation by trading GetHashCode/Equals perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a dedicated field on the MoreInfo for caching this hash
I'd guess that this caching mechanism wasn't intended to work around any perceived slowness in string.GetHashCode
as much as it was to work around the possible allocation in the below code path.
runtime/src/libraries/System.Private.Uri/src/System/Uri.cs
Lines 4946 to 4949 in 106eebd
internal static int CalculateCaseInsensitiveHashCode(string text) | |
{ | |
return text.ToLowerInvariant().GetHashCode(); | |
} |
For absolute URIs there's still the possibility for allocation in the hash code computation routine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For absolute URIs there's still the possibility for allocation in the hash code computation routine, right?
Yes, but only once. GetHashCode and Equals calls will cache the value in MoreInfo.RemoteUrl if/when they compute it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess that this caching mechanism wasn't intended to work around any perceived slowness in string.GetHashCode as much as it was to work around the possible allocation in the below code path.
I don't know why it was not cached for the relative uri path, I assume it's because Info and MoreInfo are not used for anything by relative Uris otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the caching of the hash.
chkString = GetParts(UriComponents.HttpRequestUrl, UriFormat.SafeUnescaped); | ||
tempHash = CalculateCaseInsensitiveHashCode(chkString); | ||
if (tempHash == 0) | ||
MoreInfo moreInfo = EnsureUriInfo().MoreInfo ??= new MoreInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub - how far down the rabbit hole do we want to go here with regard to null coalescing syntax? This is a clever application but I admit it might affect maintainability of code since it involves null coalescing plus a possible assignment to a field on an object (not this) that's returned as part of a function call plus a dereference of that field, all in the same line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to not take it too far :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #33042, it becomes MoreInfo moreInfo = EnsureUriInfo().MoreInfo;
I don't have concerns about using randomized hash code calculations here. The updated call sites look fairly straightforward. The NCL team should double-check the comparers used to make sure they're appropriate for the specific scenarios. (Thanks @MihaZupan for finding the discrepancy!) |
I will likely be opening a PR focused on the Equals |
Hash codes being equal doesn't mean that the values are equal
Use
text.GetHashCode(StringComparison.OrdinalIgnoreCase)
instead oftext.ToLowerInvariant().GetHashCode()
.For relative Uris also cache the computed hash like we do for absolute Uris. This makes dictionary lookups on relative Uris perform 2-5000 times faster depending on length (field access instead of string allocation and hash computation).
For absolute Uris, cache the
RemoteUrl
we just computed instead of discarding it.