Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

kmike · 2016-06-16T02:29:49Z

I'm trying to use lxml.Cleaner without parsing response multiple times:

from lxml.html.clean import Cleaner
cleaner = Cleaner()
sel = parsel.Selector("<html><body><style>.p {width:10px}</style>hello</body></html>"
cleaner.clean_html(sel.root)

This doesn't work because Cleaner needs a lxml.html.HtmlElement instance, while Selector.root is always lxml.etree._Element, so it doesn't have a required .rewrite_links method.

Why is lxml.etree.HtmlParser used for html and not lxml.html.HtmlParser?

The text was updated successfully, but these errors were encountered:

redapple · 2016-06-16T08:36:30Z

@kmike , there was some discussion in the past about moving to lxml.html.HtmlParser
scrapy/scrapy#559 (comment)
but it was left at that, needing performance comparison

rmax · 2016-06-23T18:14:54Z

@kmike @redapple Related #41

kmike · 2016-08-09T10:40:56Z

Another unscientific benchmark results (Python 3.5.1, OS X, latest lxml): https://gist.github.com/kmike/af647777cef39c3d01071905d176c006. It seems there is a 1-5% penalty in using lxml.html.HTMLParser, based on ~3700 random pages. My vote is to set HTMLParser as a default.

kmike · 2016-08-10T12:39:47Z

Another way to look at it: additional time required for using HTMLParser is 0.0001s per web page on average (3700 pages, total parsing time is increased by 0.4s).

eliasdorneles · 2016-08-10T12:42:01Z

yeah, let's just do it! 👍

eliasdorneles · 2016-08-10T13:21:41Z

@kmike what do you think? #54

eliasdorneles · 2016-11-21T16:58:12Z

This is fixed since #63:

>>> import parsel
>>> sel = parsel.Selector(u'<html><body><h1>oi</h1></body></html>')
>>> type(sel.root)
lxml.html.HtmlElement

Thanks @kmike !

kmike mentioned this issue Jun 17, 2016

Added text_content() method to selectors. #34

Closed

eliasdorneles mentioned this issue Jun 23, 2016

[MRG+1] Expose lxml.html.HTMLParser as an optional parser. #41

Closed

eliasdorneles mentioned this issue Aug 9, 2016

Get lxml node (HtmlElement) #53

Closed

eliasdorneles mentioned this issue Nov 14, 2016

[MRG+1] Change default parser to html.HTMLParser #63

Merged

eliasdorneles closed this as completed Nov 21, 2016

barrio mentioned this issue Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

kmike commented Jun 16, 2016

redapple commented Jun 16, 2016

rmax commented Jun 23, 2016

kmike commented Aug 9, 2016

kmike commented Aug 10, 2016 •

edited

Loading

eliasdorneles commented Aug 10, 2016

eliasdorneles commented Aug 10, 2016

eliasdorneles commented Nov 21, 2016

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

Selector.root is not an instance of lxml.html.HtmlElement even if parser is html #40

Comments

kmike commented Jun 16, 2016

redapple commented Jun 16, 2016

rmax commented Jun 23, 2016

kmike commented Aug 9, 2016

kmike commented Aug 10, 2016 • edited Loading

eliasdorneles commented Aug 10, 2016

eliasdorneles commented Aug 10, 2016

eliasdorneles commented Nov 21, 2016

kmike commented Aug 10, 2016 •

edited

Loading