Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent whitespace treatement #35

Open
k0001 opened this issue Nov 22, 2017 · 2 comments
Open

Inconsistent whitespace treatement #35

k0001 opened this issue Nov 22, 2017 · 2 comments

Comments

@k0001
Copy link

k0001 commented Nov 22, 2017

Leading whitespace treatment in xmlhtml is inconsistent.

Consider the following behavior of parseXML (the same happens with parseHTML) where leading whitespace is always dropped, and trailing whitespace is always kept:

> parseXML "x" ""
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" " "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = []})
> parseXML "x" "x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" " x"
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x"]})
> parseXML "x" "x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})
> parseXML "x" " x "
Right (XmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [TextNode "x "]})

See what happens, however, when the “leading whitespace” comes after some element:

> parseXML "x" "<a/> b "
Right (HtmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [Element {elementTag = "a", elementAttrs = [], elementChildren = []},TextNode " b "]})

These two examples behave differently, and I think the correct behavior is the one from the latter example, since xmlhtml should not be discarding the contents of a text node.

So, my proposal is:

  • Keep the behavior of leading whitespace after an element as it is today.

  • Keep the behavior of trailing whitespace everywhere as it is today.

  • Fix top-level text node parsing so that it doesn't discard leading whitespace.

@mightybyte
Copy link
Member

@cdsmith Any thoughts on this? I went back a ways to commit 7779316 to see if this behavior might have been related to this issue snapframework/heist#8. But we still had the current behavior back then, so at the moment I can't think of any reason not to fix this.

@cdsmith
Copy link
Member

cdsmith commented Nov 23, 2017

Yep, I have thoughts.

We have always dropped text at the beginning of a document. Code for the XML case is at https://github.com/snapframework/xmlhtml/blob/master/src/Text/XmlHtml/XML/Parse.hs#L336, where whiteSpace is explicitly ignored because we assume we're still in the prologue of the document. Fixing it would be a matter of changing whiteSpace to be a Parser Text, and then producing the result as a TextNode inside of the rule for misc. Changes for HTML would be analogous.

The original decision was caught up in the complex mess of relationships between several goals:

  1. For simplicity, we chose not to represent certain syntax, like <?xml ... > and <!DOCTYPE ... >, directly in the node tree. Instead, it's been moved to a header.
  2. It's nice if we don't write ugly output, such as you'd get by jamming those <?xml ... ?> and <!DOCTYPE ... > together with themselves or with the document content on the same lines. This includes both re-rendering documents that were parsed from nicely formatted input, and also simple documents constructed programmatically with something like a single element.
  3. It's also nice if parse . render == id, because it given meaning to the Document value. (Note, however, that the inverse render . parse == id definitely cannot hold.)

It turns out it's hard to satisfy all of these constraints at once. Because the xml decl and doctype aren't in the node tree, a naive implementation would run them together at the beginning of the document like

<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE html><sometag>

To avoid this, we choose to always write a newline after the xml decl and DOCTYPE lines. But if that were parsed as text in the document, then parse . render adds a newline to the Document value that wasn't there before. Oops! To fix this, we currently ignore whitespace until the first element or non-whitespace text (which is conveniently also the moment when DOCTYPE is no longer allowed.)

I'm not sure this is the right balance. And in fact, we don't even preserve all those nice properties, because it's possible to manually construct a Document starting with a TextNode that's only whitespace. We used to work around that by rendering it as an entity, which got you out of the document prologue, and caused it to not be ignored, but #17 broke that solution. (Incidentally, I should have brought this up in the context of #17 itself. I mistakenly thought that if it broken the parse . render property, then it would break a unit test; but I failed to anticipate that someone would dodge the test failure by treating some whitespace characters differently from others. The rerender test should have failed for https://github.com/snapframework/xmlhtml/blob/master/test/resources/oasis/p27fail1.xml -- but the pull request only changed the behavior for newline characters, and the test case used a space instead. 😢)

Maybe there's a better way to reconcile all these nice goals with more special-case rules. Maybe since one of them has been broken for a year and a half in some corner cases anyway, it's time to just give up on the parse . render property entirely, and let parse/render cycles start adding more redundant newlines each time around the loop. I'm okay with whatever behavior people want.

By the way, changing this is not a backward-compatible change. It's among the more benign sorts of non-backward-compatible changes in that code that breaks was probably a little broken already; but still, it's possible someone is doing something like just grabbing the first node of a parsed Document, and it will now no longer be an Element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants