Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketsHttpHandler.ResponseHeaderEncodingSelector may be ignored for Location headers #95799

Closed
MihaZupan opened this issue Dec 8, 2023 · 1 comment · Fixed by #95810
Closed

Comments

@MihaZupan
Copy link
Member

SocketsHttpHandler will generally assume headers were encoded as Latin1.
The location header is an exception where we'll also check if the bytes look like UTF-8 and decode them using that if they do.

else if (knownHeader == KnownHeaders.Location)
{
// Normally Location should be in ISO-8859-1 but occasionally some servers respond with UTF-8.
if (TryDecodeUtf8(headerValue, out string? decoded))
{
return decoded;
}
}

The handler also has the option for users to specify the encoding to use for each header via ResponseHeaderEncodingSelector, but in the case of the Location header, it may still use UTF-8, overriding the user's wishes to use a different encoding.

This is mostly harmless for most uses of HttpClient but does impact scenarios where the user is expecting a different encoding to be used, even if the bytes do look like UTF-8.
For example with YARP, the user may decide to set Latin1 as the encoding to use for all headers in all directions. No matter what encoding the values are actually using, we'll use Latin1. While the string representations of those headers may look like garbage, Latin1 decoding and then Latin1 encoding some bytes will result in the same set of bytes, thus passing values through the proxy without data loss. That is, two sets of data corruption are canceling each other out.

But if SocketsHttpHandler decides to use a different encoding, this no longer holds. While the string representation is now likely going to be "correct", the value will now be modified when encoded using Latin1.

Given that ResponseHeaderEncodingSelector gives the user the option to specify the encoding to use for each header, I believe we should be honoring that first if set, before trying to guess what encoding was used.


Side note: Even if the bytes are valid UTF-8, that doesn't mean they actually are UTF-8. There are valid sequences of Latin1 encoded text that will produce byte sequences that also happen to be valid UTF-8. For such cases, SocketsHttpHandler is silently corrupting the data (the string header value will be nonsense).

@ghost
Copy link

ghost commented Dec 8, 2023

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

SocketsHttpHandler will generally assume headers were encoded as Latin1.
The location header is an exception where we'll also check if the bytes look like UTF-8 and decode them using that if they do.

else if (knownHeader == KnownHeaders.Location)
{
// Normally Location should be in ISO-8859-1 but occasionally some servers respond with UTF-8.
if (TryDecodeUtf8(headerValue, out string? decoded))
{
return decoded;
}
}

The handler also has the option for users to specify the encoding to use for each header via ResponseHeaderEncodingSelector, but in the case of the Location header, it may still use UTF-8, overriding the user's wishes to use a different encoding.

This is mostly harmless for most uses of HttpClient but does impact scenarios where the user is expecting a different encoding to be used, even if the bytes do look like UTF-8.
For example with YARP, the user may decide to set Latin1 as the encoding to use for all headers in all directions. No matter what encoding the values are actually using, we'll use Latin1. While the string representations of those headers may look like garbage, Latin1 decoding and then Latin1 encoding some bytes will result in the same set of bytes, thus passing values through the proxy without data loss. That is, two sets of data corruption are canceling each other out.

But if SocketsHttpHandler decides to use a different encoding, this no longer holds. While the string representation is now likely going to be "correct", the value will now be modified when encoded using Latin1.

Given that ResponseHeaderEncodingSelector gives the user the option to specify the encoding to use for each header, I believe we should be honoring that first if set, before trying to guess what encoding was used.


Side note: Even if the bytes are valid UTF-8, that doesn't mean they actually are UTF-8. There are valid sequences of Latin1 encoded text that will produce byte sequences that also happen to be valid UTF-8. For such cases, SocketsHttpHandler is silently corrupting the data (the string header value will be nonsense).

Author: MihaZupan
Assignees: -
Labels:

bug, area-System.Net.Http

Milestone: -

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 8, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2023
@MihaZupan MihaZupan added this to the 9.0.0 milestone Dec 8, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Dec 8, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 10, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant