Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xslt document function fails when path Contains Private Area Unicode Characters #59353

Open
Tracked by #64602
benvillalobos opened this issue Sep 20, 2021 · 12 comments
Open
Tracked by #64602

Comments

@benvillalobos
Copy link
Member

benvillalobos commented Sep 20, 2021

Description

Original issue: dotnet/msbuild#6847
MSBuild's PR: dotnet/msbuild#6863

Initializing an XmlReader using XmlReader.Create(<path-to-file>) where the path contains characters from the unicode private use area causes issues at file load. Part of the path that's the culprit: Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]

The workaround is to call XmlReader.Create using a StreamReader. However, when MSBuild is calling its XslTransform task on a file in this type of path, and in the xsl file it calls the Document function on a file that also exists in a path like this, it will break. See details here: dotnet/msbuild#6863 (comment)

This comment explains the issue. You hit an exception when calling the document function in an xslt file when that xslt file is under a path that contains private area unicode characters.

Xslt file example:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">

<xsl:value-of select="document('b.xml')"/>
</xsl:template>

</xsl:stylesheet>

Place that next to a b.xml in a path that contains Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[].

Configuration

Full framework & net core 6.0.100-preview.7.21379.14

Regression?

No

Other information

See original issue / MSBuild PR for in-depth notes.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Xml untriaged New issue has not been triaged by the area owner labels Sep 20, 2021
@ghost
Copy link

ghost commented Sep 20, 2021

Tagging subscribers to this area: @buyaa-n, @krwq
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Original issue: dotnet/msbuild#6847
MSBuild's PR: dotnet/msbuild#6863

Initializing an XmlReader using XmlReader.Create(<path-to-file>) where the path contains characters from the unicode private use area causes issues at file load. Part of the path that's the culprit: Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]

The workaround is to call XmlReader.Create using a StreamReader. However, when MSBuild is calling its XslTransform task on a file in this type of path, and in the xsl file it calls the Document function on a file that also exists in a path like this, it will break. See details here: dotnet/msbuild#6863 (comment)

Configuration

Full framework & net core 6.0.100-preview.7.21379.14

Regression?

No

Other information

See original issue / MSBuild PR for in-depth notes.

Author: BenVillalobos
Assignees: -
Labels:

area-System.Xml, untriaged

Milestone: -

@krwq
Copy link
Member

krwq commented Oct 1, 2021

@benvillalobos can you provide a minimal repro? Ideally something which creates a file on the fly and uses \uXYZV notation for non-english characters

string path = @"Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵.xml";
string content = @"<root></root>";
File.WriteAllText(path, content);
using XmlReader reader = XmlReader.Create(path);

and getting no exception.

What's the exception you're getting?

@krwq krwq added needs more info and removed untriaged New issue has not been triaged by the area owner labels Oct 1, 2021
@benvillalobos
Copy link
Member Author

You'll need to include U1[]U2[]U3[] as part of the path. Adding that and using your code in an empty console proj caused the repro for me.

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs author feedback labels Oct 1, 2021
@jeffhandley jeffhandley added this to the Future milestone Oct 4, 2021
@ghost ghost removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 4, 2021
@krwq
Copy link
Member

krwq commented Oct 5, 2021

@benvillalobos what is the exception you're seeing?

@krwq krwq modified the milestones: Future, 7.0.0 Oct 5, 2021
@krwq
Copy link
Member

krwq commented Oct 5, 2021

from the e-mail conversation this might be blocking msbuild team so changing milestone

@benvillalobos
Copy link
Member Author

OH, apologies. The title for this issue was incorrect. See this comment for details: dotnet/msbuild#6863 (comment)

Exception gets thrown on this line https://github.com/dotnet/msbuild/blob/main/src/Tasks/XslTransformation.cs#L176

In summary, when transforming an xslt doc who's path contains the unicode mess (Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]), you'll hit an XslTransformException if that xslt calls the document function to find another xml file next to it.

@benvillalobos benvillalobos changed the title XmlReader.Create Fails When Path Contains Private Area Unicode Characters Xslt document function fails when path Contains Private Area Unicode Characters Oct 5, 2021
@benvillalobos
Copy link
Member Author

Also this isn't a blocking issue. We worked around one aspect of the issue which was loading the xslt doc in the first place. But we also see the error when trying to load another document while transforming the first.

@krwq
Copy link
Member

krwq commented Oct 5, 2021

@benvillalobos any chance you could create a project with a tiny repro?

@benvillalobos
Copy link
Member Author

benvillalobos commented Oct 6, 2021

proj1.zip

Steps to reproduce:
Download zip, create a folder next to proj1.csproj called Ⅻㄨㄩ 啊阿鼾齄丂丄狚狛狜狝﨨﨩ˊˋ˙–⿻〇㐀㐁䶴䶵U1[]U2[]U3[]. Drag transform.xslt and b.xml into it.

Note: If you have the latest and greatest VS, it probably already has this change. So msbuild <pathtoproj1>.csproj on a dev command prompt should suffice. If not, do the steps below:

git clone https://github.com/dotnet/msbuild
cd msbuild
build.cmd /p:CreateBootstrap=true
cd artifacts\bin\bootstrap\net472\MSBuild\Current\Bin
msbuild.exe <pathtoproj1.csproj>

You should get the error error MSB3703: Unable to execute transformation. An error occurred while loading document 'b.xml'. See InnerException for a complete description of the error.. After you've confirmed the repro, debug into it to see the inner exception.

Open up MSBuild.Dev.slnf in VS.

On same command line:
set MSBUILDDEBUGONSTART=1
msbuild.exe <pathtoproj1.csproj>

Place a breakpoint in the line I mentioned the exception gets thrown (2 comments above)

@bgrainger
Copy link
Contributor

bgrainger commented Nov 26, 2021

IMO this seems to be caused by unexpected behaviour in System.Uri: specifically, a PUA character is not roundtripped through a file:/// URI for new Uri(x).LocalPath.

This test program tests the Uri constructor with various characters that have different UTF-8 encoded lengths; the last is a PUA character:

foreach (var name in new[] { "a", "\u00A1", "\u0800", "\U00010000", "\uE000" })
{
    var filePath = $@"C:\Temp\{name}.txt";
    var uri = new Uri(filePath);
    Console.WriteLine("{0}\t{1}\t{2}\t{3}", uri.LocalPath == filePath, filePath, uri.AbsoluteUri, uri.LocalPath);
}

The output shows that the file path does round-trip for everything except the PUA character, which is double-encoded in the URI.

True   C:\Temp\a.txt  file:///C:/Temp/a.txt  C:\Temp\a.txt
True   C:\Temp\¡.txt  file:///C:/Temp/%C2%A1.txt  C:\Temp\¡.txt
True   C:\Temp\ࠀ.txt  file:///C:/Temp/%E0%A0%80.txt  C:\Temp\ࠀ.txt
True   C:\Temp\𐀀.txt  file:///C:/Temp/%F0%90%80%80.txt  C:\Temp\𐀀.txt
False  C:\Temp\.txt  file:///C:/Temp/%25EE%2580%2580.txt  C:\Temp\%EE%80%80.txt

I haven't consulted the URI RFC to determine whether this would be considered correct. You don't get the same failure to round-trip for $@"https://example.com/{name}.txt" so I suspect not.

If System.Uri can't be changed in this instance (for back-compat concerns), then XmlUriResolver and/or XmlDownloadManager might need to work around the double-encoding?

@bgrainger
Copy link
Contributor

I wonder if System.Uri parsing for file paths goes through the IRI code path, which explicitly excludes PUA characters (as per https://datatracker.ietf.org/doc/html/rfc3987#section-2.2):

internal static bool CheckIriUnicodeRange(char unicode, bool isQuery)
{
return IsInInclusiveRange(unicode, '\u00A0', '\uD7FF')
|| IsInInclusiveRange(unicode, '\uF900', '\uFDCF')
|| IsInInclusiveRange(unicode, '\uFDF0', '\uFFEF')
|| (isQuery && IsInInclusiveRange(unicode, '\uE000', '\uF8FF'));
}

A fix would be to change this line of code as follows:

uri = new Uri(Path.GetFullPath(relativeUri!));

                        uri = new Uri(Path.GetFullPath(relativeUri!), new UriCreationOptions { DangerousDisablePathAndQueryCanonicalization = true });

However, I don't know the security implications of this, given the property name.

@krwq
Copy link
Member

krwq commented Dec 8, 2021

I guess, since you can pass your own XmlResolver possibly you can just derive and apply workaround in the new type. We should still figure out if this is XML or Uri issue. The Dangerous* setting doesn't sound like a good idea but I'm not sure what are the exact implications of that or why it was created.

@karelz do we have any Uri experts who could help with the questions above?

@jeffhandley jeffhandley modified the milestones: 7.0.0, Future Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants