-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guess page layout #9
Conversation
The current implementation is ~20% faster than the previous one. I tested extraction on 5 different webpages wit a lot of texts (articles or forums) and the average time for the old approach using selectors is 0.012s while for the new one is 0.010 |
Hey @Kebniss! Discussion about punctuation handling is here: #2.
So for me it looks like all failures are real. |
@@ -47,45 +49,71 @@ def parse_html(html): | |||
_has_punct_before = re.compile(r'\($').search | |||
|
|||
|
|||
def selector_to_text(sel, guess_punct_space=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was useful - the main use case is to extract text from a part of a web page, finding this part using Scrapy or parsel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get a parsed tree for a selector using sel.root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so you want to create a selector in extract_text(html)
and then apply traverse_text_fragments
on sel.root
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if selector_to_text
is supported, cleaned_selector
is also nice to have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kebniss no, extract_text doesn't need to use Selector, it is an additional overhead. The idea is to be backwards compatible and provide the same feature for Selector; internally it can work the other way around - likely selector_to_text should pass sel.root to html_to_text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have an tail_text
argument in html_to_text? When it is False, tail extraction is skipped, but only on top level (i.e. text is still extracted from children tails). It can be False by default - I don't see why would anyone want to extract text from the tail of the root element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can just not extract text from element tail by default, on the top level (i.e. children should have tail text processed as usual).
In a common case (root <html>
element) there shouldn't be any text in the tail. And when user passes another element explicitly, extracting text from element tail is likely undesirable - it is the same issues as with Selectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, tail text is outside selected nodes and as such it should not be extracted. Not extracting it by default seems reasonable. I will add the root object as argument to check when the recursion call is processing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a root object, can't is just be a boolean flag process_tail
in some internal function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yy, root node is unnecessary. I added a depth argument so that we know when the recursion is back to root and does not extract tail there
html_text/html_text.py
Outdated
def html_to_text(tree, guess_punct_space=True, guess_page_layout=False): | ||
""" Convert a cleaned html tree to text. | ||
See html_text.extract_text docstring for description of the approach | ||
and options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting is off, text in docstring is indented too much (not like in pep8)
html_text/html_text.py
Outdated
and not _has_punct_before(prev) | ||
) | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be more readable as several ifs, e.g.
if prev is None:
return False
if _has_trailing_whitespace(prev):
return False
if _has_punct_after(text) or _has_punct_before(text):
return False
return True
or when converted to an expression with or, not and:
return not(
prev is None
or _has_trailing_whitespace(prev)
or _has_punct_after(text)
or _has_punct_before(prev)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Codecov Report
@@ Coverage Diff @@
## master #9 +/- ##
==========================================
- Coverage 100% 98.05% -1.95%
==========================================
Files 2 2
Lines 42 103 +61
Branches 6 28 +22
==========================================
+ Hits 42 101 +59
- Misses 0 2 +2
Continue to review full report at Codecov.
|
html_text/html_text.py
Outdated
|
||
def html_to_text(tree, guess_punct_space=True, guess_page_layout=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this function private - rename it to _html_to_text
, remove from __init__
exports. The API becomes rather confusing: there is html_text.extract_text
and html_text.html_to_text
, which do almost the same - extract_text
supports html as a string in addition to lxml trees, and also does cleaning on its own; html_to_text
only works on lxml trees, and is not doing cleaning.
If allowing to pass an already cleaned tree is important, we can add an argument to extract_text
- though this can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only reason for not wanting the html to be cleaned is tied to performance. Let's just make _html_to_text
private and later I can check performance with and without cleaning to see if there is a difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kebniss right, though let's not worry about adding clean=True
argument now. Main use cases are
- user has an already cleaned tree, and wants to save some time by avoiding cleaning again;
- user wants to change cleaning options.
html_text/html_text.py
Outdated
@@ -85,7 +147,7 @@ def cleaned_selector(html): | |||
return sel | |||
|
|||
|
|||
def extract_text(html, guess_punct_space=True): | |||
def extract_text(html, guess_punct_space=True, guess_page_layout=False, new=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new
argument is unused and undocumented
@@ -98,5 +160,7 @@ def extract_text(html, guess_punct_space=True): | |||
|
|||
html should be a unicode string or an already parsed lxml.html element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess_page_layout argument should be documented.
html_text/html_text.py
Outdated
@@ -7,6 +7,9 @@ | |||
import parsel | |||
|
|||
|
|||
NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol'] | |||
DOUBLE_NEWLINE_TAGS = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lists are used for lookups; even though they're short, I think it is cleaner and faster to have them as sets.
In [1]: x = ['title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
In [2]: x_set = {'title', 'p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}
In [3]: %timeit 'foo' in x
162 ns ± 0.398 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %timeit 'foo' in x_set
39.8 ns ± 0.153 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
^^ lookup in a set is ~4x faster even when the list is that short; not a lot, but why not do that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd also be nice to allow overriding these double_newline_tags and newline_tags in extract_text; these constants would then just be defaults (use frozenset instead of set if you do so). A use case is the following: you want to extract text from a particular website, and know that e.g. div
element should add a new line. You then write
extract_text(html, guess_page_layout=True, newline_tags=NEWLINE_TAGS | {'div'})
html_text/html_text.py
Outdated
@@ -7,6 +7,9 @@ | |||
import parsel | |||
|
|||
|
|||
NEWLINE_TAGS = ['li', 'dd', 'dt', 'dl', 'ul', 'ol'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we add table tags like tr and th as well? Check e.g http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html - currently product information is on the same line.
It'd also be nice to add a few realistic tests: a few examples of HTML pages and their text output (in separate files, for readability). Text should be extracted with guess_page_layout=True. I think this would allow us to detect regressions / changes in the output better, and also find cases which are not handled properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: which HTML tags are not in these two lists because they shouldn't be there, and which are not there because we're not handling them yet? Which of the tags from https://developer.mozilla.org/en-US/docs/Web/HTML/Element have you checked? Maybe it makes sense to handle more of them?
guess_page_layout=guess_page_layout) | ||
if extracted: | ||
text.append(extracted) | ||
return ' '.join(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to have it work as the previous implementation however I think it would make more sense to have it return a list of the text extracted by each selector. This way the user can decide whether and how to join it. Maybe they need all text as separate entities and that's why they want to select specific elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is not clear which behavior is better; let's keep the current one, but have this problem in mind.
This might be unrelated, but I wonder if we add a space at place of |
I haven't checked everything, but a question about benchmarks: why is selector_to_text slower than extract_text? I expected it to be the other way around, as selector_to_text shouldn't parse html. |
In the benchmark for selector I included the time to create the selector. Without it the average execution time for the new implementation is avg: 0.00372 while for the old implementation it takes avg: 0.00821. In this case this implementation is ~ 55% faster than master |
I can add this in the following pr |
Thanks, that would be great if this does not blow up the scope :) |
README.rst
Outdated
|
||
>>> text = html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True) | ||
u'Hello | ||
world!' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples look off (actually, previous example was not good as well). I think it should look like this if you try it in a Python console:
>>> html_text.extract_text(u'<h1>Hello</h1> world!', guess_page_layout=True)
'Hello\n\nworld!'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples below have the same issue: it shouldn't be text = html_text....
, just html_text....
, otherwise there is no output
README.rst
Outdated
* DOUBLE_NEWLINE_TAGS = frozenset([ | ||
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol', | ||
'p', 'pre', 'title', 'ul' | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to just say constants are html_text.NEWLINE_TAGS and html_text.NEWLINE_TAGS (and maybe expose them to a top level) - copy-pasting these lists here requires maintenance, it is easy to forget to update README when making a code change
html_text/html_text.py
Outdated
@@ -96,7 +185,22 @@ def extract_text(html, guess_punct_space=True): | |||
for punctuation. This has a slight (around 10%) performance overhead | |||
and is just a heuristic. | |||
|
|||
When guess_page_layout is True (default is False), a newline is added | |||
before and after NEWLINE_TAGS and two newlines are added before and after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more precise to say newline_tags
instead of NEWLINE_TAGS
- "a newline is added before and after newline_tags
", and remove a note below ("NEWLINE_TAGS and DOUBLE_NEWLINE_TAGS can be customized.") - users shouldn't be changing NEWLINE_TAGS, they should be passing newline_tags
arguments, e.g.
html_text.extract_text(html, guess_page_layout=True, newline_tags=html_text.NEWLINE_TAGS | {'div'})
^^ maybe we should even provide this example somewhere, adding div
may be a common thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add an example in the readme but div is already included in NEWLINE_TAGS. I will use a different tag for clarity :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you want me to add a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's not a bad idea - let's to do both.
I tested the new extraction algorithm on 1200 htmls and timed extraction performance:
extract_text(html, guess_page_layout=True, guess_punct_space=True)
:html_text.selector_to_text(sel, guess_punct_space=True, guess_page_layout=True)
:Did the same on the old extraction algorithm:
extract_text(html, guess_punct_space=True)
:text = html_text.selector_to_text(sel, guess_punct_space=True)
:html extraction speed has improved of ~20% while selector extraction improved of ~6%
TODO: