Use the HTML5 parser from html5lib when it's available #12

kennyluck · 2012-09-02T02:26:02Z

I am not particular happy about the namespace kludge myself, but I'll post this patch here just in case Simon wants to take it. It should also be useful for anyone else who wants to test WeasyPrint's standards-compliance without falling into some HTML parsing trap.

The kludge can be removed whenever one of following applies:

lxml supports XPath 2.0's namespace wildcard *:html and cssselect outputs it.
cssselect outputs XPath of the form *[localname() = "html"]

. I am not prepared to go into hacking lxml/cssselect so I'll stop here.

SimonSapin · 2012-09-04T02:24:03Z

The namespacing stuff is a separate issue in cssselect. For the HTML parser, I'd rather not switch silently, but only use html5parser when explicitly requested by the user (and then fail if it can not be imported). I'm open to ideas for the specific API: maybe a subclass or an additional parameter.

SimonSapin · 2012-09-09T22:05:00Z

Could you provide an HTML test case where html5parser parses differently from lxml.html?

SimonSapin · 2012-09-11T11:05:34Z

Oh, I had only seen half of the issue. We can easily add support for a different HTML parser in WeasyPrint, but html5parser is useless because of the namespace issues in cssselect. This is fixable in cssselect, just low priority for me right now.

In the meantime, this is a better way to do the same, without patching WeasyPrint:

tree = lxml.html.html5parser.parse('http://example.net')  # Or whatever
for element in tree.iter():
    # Comments and other non-element stuff have a non-string "tag"
    if hasattr(element.tag, 'split'):
        # lxml uses the '{namespaceURI}localname' syntax
        element.tag = element.tag.split('}')[-1]
html = weasyprint.HTML(tree=tree, base_url='http://example.net')

SimonSapin · 2012-09-11T11:06:09Z

Leaving this open: I still want to support various parsers eventually.

Switch to html5lib to parse HTML. Fix #12.

* Do not use element.base_url which only exists in lxml.html.HtmlElement * Use lxml.etree.HtmlParser instead of lxml.html This is one step toward using the html5lib parser, but see Kozea#12

kennyluck added 2 commits August 31, 2012 20:08

Use the HTML5 parser from html5lib when it's available.

06b2f36

clean up

0366536

SimonSapin mentioned this pull request Nov 6, 2012

Support short meta charset tag #15

Closed

SimonSapin mentioned this pull request Mar 28, 2013

Image missing at start of document #65

Closed

SimonSapin added a commit that referenced this pull request Jul 22, 2013

Switch to html5lib to parse HTML. Fix #12.

f266847

SimonSapin mentioned this pull request Jul 22, 2013

Switch to html5lib to parse HTML. Fix #12. #112

Merged

SimonSapin closed this in 4069a1c Oct 17, 2013

liZe added a commit that referenced this pull request Oct 17, 2013

Merge pull request #112 from Kozea/html5lib

82a188e

Switch to html5lib to parse HTML. Fix #12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the HTML5 parser from html5lib when it's available #12

Use the HTML5 parser from html5lib when it's available #12

kennyluck commented Sep 2, 2012

SimonSapin commented Sep 4, 2012

SimonSapin commented Sep 9, 2012

SimonSapin commented Sep 11, 2012

SimonSapin commented Sep 11, 2012

Use the HTML5 parser from html5lib when it's available #12

Use the HTML5 parser from html5lib when it's available #12

Conversation

kennyluck commented Sep 2, 2012

SimonSapin commented Sep 4, 2012

SimonSapin commented Sep 9, 2012

SimonSapin commented Sep 11, 2012

SimonSapin commented Sep 11, 2012