-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1 #34
Comments
@izian I can replicate the behavior that @nahsra - Are these special unicode characters supposed to get stripped by AntiSamy's default policy? |
The reason we were searching for these unusual (at least, they were unusual 10 years ago) Unicode ranges was the underlying XML parsing API blew up when encountering them. Therefore we had to add some defensive code to filter them out before processing. I agree that this code could be cleaned up a lot, and the XML dependencies could have become more sturdy since then, and that there is nominal risk from this behavior. This issue should stay open and be acted upon but isn’t a blocker for release. I appreciate the thoughtful write up. |
To be clear this issue is not a vulnerability (@nahsra's 'nominal risk' comment) . It is intended functionality that doesn't work, but the fact that it doesn't isn't a security issue. |
Thanks for checking it out. I just wanted to point out the matches / find. It’s a common trip up. Thanks for the time looking in to it |
@nahsra - Given the comment: "given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed." do you think we should still try to fix this, or just leave it alone? |
@spassarop - Hey Sebastian - any. clue how to address this old issue? There are two tests cases for this issue already in the test class, but they are commented out. testGithubIssue34a() and testGithubIssue34b(). Same for issue #24 - and test case: testGithubIssue24(). If you could address either or both that would be fantastic. |
@davewichers, @izian - I looked into the function that checks de regex of "invalid characters". I've tested with I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern: "[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]" I've built a string with all the supposedly invalid characters with the following code: StringBuilder s = new StringBuilder();
char c = (char)0x0;
do {
if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )
s.append(c);
c = (char)(c+1);
} while(c!=0xffff); Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern): assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML()); Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java: if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
StringBuilder cleanText = new StringBuilder(); // Used to hold the output.
char current; // Used to reference the current character.
for (int i = 0; i < in.length(); i++)
{
current = in.charAt(i);
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
{
cleanText.append(current);
}
}
return cleanText.toString(); And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous About SAX parser:
That's all I can say after some hours of analyzing this. |
Thanks Sebastian for researching! Do you think we should change/fix
AntiSamy in some way or are we already doing the best we can with the
parsers we are using?
…-Dave
On Sun, Jan 17, 2021 at 5:38 PM Sebastián Passaro ***@***.***> wrote:
@davewichers <https://github.com/davewichers>, @izian
<https://github.com/izian> - I looked into the function that checks de
regex of "invalid characters". I've tested with find() instead of
matches() and for the DOM parser it gets better results. However, when it
comes to replacing, the surrogate pair \uD83D\uDC95 which forms 💕 does
not want to be entirely replaced. Instead, one of the unicode characters
remains.
I've tried with different inputs and changing the pattern but the result
does not improve. I know it probably isn't a performant solution, but
implementing the replacement by hand does in fact replace all characters.
To remind, this is the pattern:
"[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]"
I've built a string with all the supposedly invalid characters with the
following code:
StringBuilder s = new StringBuilder();
char c = (char)0x0;
do {
if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )
s.append(c);
c = (char)(c+1);
} while(c!=0xffff);
Then I've run this assertion which leaves a single unicode character
(can't remember which one but it's in the middle of a range from the
pattern):
assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML());
Then I've copied and pasted the current implementation on the .NET project
for StripNonValidXmlCharacters and made the changes to compile in Java:
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
StringBuilder cleanText = new StringBuilder(); // Used to hold the output.
char current; // Used to reference the current character.
for (int i = 0; i < in.length(); i++)
{
current = in.charAt(i);
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
{
cleanText.append(current);
}
}
return cleanText.toString();
And well, that works because it's "manual" replacing without depending on
Java's builtin functions. The previous find() call can prevent this
replacing from happening if it's too expensive to always do it.
About SAX parser:
- It does not have the character check, it's implemented only at DOM
parser.
- If it gets implemented for SAX, it has the same behavior as DOM. But
when the input is a Reader object, AntiSamy can't read the plain HTML
without some transformations that I hope they don't mess with any character
in it, if implemented.
That's all I can say after some hours of analyzing this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGFWBKW57DSVJONMMCIRM3S2NRGTANCNFSM4GXEVWGQ>
.
|
I can take another look to refresh my 2 year memory of this one in the morrow. I recognise that the use of this method by our developer is probably far removed from its intended use. The same developer also escaped JSON as if it were HTML. I only happened upon this library by chance chasing down my missing code point and thought it was here it was lost; it was actually another library pretending to do good for XML and it only removed one of the code points; leaving back part of a surrogate pair which destroyed the XML when it was being serialised again. maybe I could afford some time to take a look again here and see if I can follow what you’ve written and come to any understanding of a fix in line with your intention for the method; not our use :-) |
I've extended the test @izian mention before, the one to test performance, just to test locally how bad it was the solution I proposed above. Executions vary, sometimes the result in the "manual" replacement is faster that The only thing that worries me is the SAX parser with the Reader object, because of the default encoding for reading it, do the cleansing and make the object again without losing something. Maybe I'm worrying when I shouldn't but I don't know that. However, I've built the test string with all unicode characters that should be stripped also to make another test. The test is to don't strip at all. I've commented the line where cleansing occurrs and the library gave an output, so it didn't blow up. Maybe after the NekoHTML updates that is no longer a problem and we can remove the invalid XML characters validation. I just give that information, I prefer not to make that decision. |
It looks like somewhere around version 1.5 that the method
org.owasp.validator.html.scan.AntiSamyDOMScanner#stripNonValidXMLCharacters
was altered to check if thePattern
forinvalidXmlCharacters
java.util.regex.Matcher#matches()
.I presume that was in a bid for efficiency to cut down on a
replaceAll
method call if there was no need to affect theString
input.You use the same technique in a test to check if there are time improvements made.
I believe it should use
java.util.regex.Matcher#find()
instead.matches()
checks if the entire sequence matches the pattern. Since the pattern represents only a single character, in effect, that can be in one of the defined sets, then if the sequence (the HTML) is longer than 1 character it can never match. It's been this way since forever I believe, at least Java 5.find()
will find the next subsequence that matches the pattern, in effect checking quickly and succeeding fast if the HTML needs to be cleansed.You're getting a speed increase because
matches()
is getting to the secondchar
and declaring the sequence as a non-match regardless.Input HTML:
<div>Hello\uD83D\uDC95</div>
Expected on
org.owasp.validator.html.CleanResults#getCleanHTML
:<div>Hello</div>
Actual:
<div>Hello\uD83D\uDC95</div>
Where input HTML is single character
\uD888
only then the output is the empty string as expected.I looked through the test class here and can see no tests where you are expecting data to be cleansed. All the tests ensure that characters make it through ok or that something is faster (checking only 1 char is faster!)
Incidentally, I only noticed this since the Antisamy code looked to want to cleanse the characters needed for an emoji, where the character is actually valid in XML and HTML spec so far as I can tell, when their UTF-8 bytes are read by our system we get a Java representation in 16 bit
char
underpinning theString
and the character points fall within your filter and, although I don't believe you should be stripping those if they come together and according tojava.lang.Character#isSurrogatePair
the two\uD83D \uDC95
together returntrue
rather thanfalse
and thetoCodePoint
method tells us that it's💕
. So I think the checks in this method ought to be more complex.Ironically, if the code in this method worked as intended then the characters would have been cleansed away. But they weren't.
I believe you could get manipulative code points through now, because of this. But I can't be certain as I'm looking purely from a data cleansing point of view.
The text was updated successfully, but these errors were encountered: