Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to guess page layout (try to preserve some of the formatting) #11

Merged
merged 54 commits into from
Sep 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
0ae6d24
add first working approach plus debug code
Kebniss Aug 24, 2018
566dc9b
add newline only at the end of selected tags
Kebniss Aug 24, 2018
587e9a7
fix multiple consecutive newlines
Kebniss Aug 27, 2018
6c9d27e
add guess_space = False option
Kebniss Aug 27, 2018
c22f3fa
move add space and newline checks to a function
Kebniss Aug 28, 2018
8a78fc5
add tests guess_page_layout
Kebniss Aug 28, 2018
a783e31
remove old test
Kebniss Aug 29, 2018
cb8dc1c
guess_punct_space = False behavior same as before this PR
Kebniss Aug 30, 2018
fb599bc
fix tests
Kebniss Aug 30, 2018
90e37b7
fixed tests
Kebniss Aug 30, 2018
ae26d29
fix indent and make add_space more readable
Kebniss Aug 30, 2018
bb33d4b
add double newline before and after title, p and h tags
Kebniss Aug 31, 2018
3069a73
by default tail of root node will not be extracted
Kebniss Sep 6, 2018
dd03201
add test
Kebniss Sep 6, 2018
0f2fb2b
fix indentation
Kebniss Sep 7, 2018
e8da507
newline tags as set and extendable, add new features comments, delete…
Kebniss Sep 7, 2018
0b9d139
make html_to_text private, fix its signature
Kebniss Sep 8, 2018
ba7cdc0
add new tags to handle
Kebniss Sep 8, 2018
952d895
handle more tags
Kebniss Sep 10, 2018
9dafbf0
remove cleaning of inline tags
Kebniss Sep 11, 2018
b3229d6
fix bug with multiple newlines
Kebniss Sep 11, 2018
695b458
remove newline
Kebniss Sep 11, 2018
03259b9
add test html without text
Kebniss Sep 11, 2018
cba531f
fix newline + space bug
Kebniss Sep 11, 2018
9811349
add bad punct test
Kebniss Sep 11, 2018
d47138c
add newline
Kebniss Sep 11, 2018
76f9028
add tests on real webpages
Kebniss Sep 11, 2018
05c7702
tests to hopefully make codecov happy
Kebniss Sep 11, 2018
4505e24
remove pathlib import
Kebniss Sep 11, 2018
a27e4c8
fix test
Kebniss Sep 11, 2018
b926c8c
remove space
Kebniss Sep 12, 2018
73f49ad
handle list of selectors
Kebniss Sep 19, 2018
15d22e0
a list of selectors returns a list of texts
Kebniss Sep 19, 2018
8f68b2c
selectors_to_text add to res only if something is extracted
Kebniss Sep 20, 2018
cf02b94
selectors_to_text merge results as in previous implementation
Kebniss Sep 20, 2018
7aec8d2
update readme
Kebniss Sep 20, 2018
7653bf9
update history
Kebniss Sep 20, 2018
4300fe6
update readme
Kebniss Sep 20, 2018
4772061
update readme and add newline personalization tests
Kebniss Sep 20, 2018
05b979a
change documentation
Kebniss Sep 20, 2018
ad95bff
DOC cleanup README
kmike Sep 21, 2018
59d2d54
DOC cleanup function docstring
kmike Sep 21, 2018
51947d4
revert formatting change
kmike Sep 21, 2018
1370647
minor cleanup
kmike Sep 21, 2018
ab3f776
add pytest files to gitignore
kmike Sep 24, 2018
22a7fa1
refactor _html_to_text function for readability:
kmike Sep 24, 2018
e161e92
bikeshedding: rename guess_page_layout option to guess_layout, for br…
kmike Sep 24, 2018
8b466f8
TST mark test as xfail, change desired output
kmike Sep 24, 2018
729e11a
cleanup: comments in unclear places
kmike Sep 25, 2018
2973ee0
cleanup: remove unnecessary escaping in regex
kmike Sep 25, 2018
607b04a
backwards incompatible: make guess_layout=True by default
kmike Sep 25, 2018
13394ba
typo fix in comment
kmike Sep 25, 2018
7a1b57b
make it clear "\n" and "\n\n" are constants which can't come from ele…
kmike Sep 25, 2018
732c87d
remove PY3-only assert
kmike Sep 25, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ nosetests.xml
coverage.xml
*,cover
.hypothesis/
.pytest_cache

# Translations
*.mo
Expand Down
14 changes: 14 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,20 @@
History
=======

0.4.0 TDB
------------------

This is a backwards-incompatible release: by default html_text functions
now add newlines after elements, if appropriate, to make the extracted text
to look more like how it is rendered in a browser.

To turn it off, pass ``guess_layout=False`` option to html_text functions.

* ``guess_layout`` option to to make extracted text look more like how
it is rendered in browser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we enable it by default? There is no speed hit and the text looks nicer (almost always).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also wonder about this. Let's enable it by default.

* Add tests of layout extraction for real webpages.


0.3.0 (2017-10-12)
------------------

Expand Down
66 changes: 50 additions & 16 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ How is html_text different from ``.xpath('//text()')`` from LXML
or ``.get_text()`` from Beautiful Soup?
Text extracted with ``html_text`` does not contain inline styles,
javascript, comments and other text that is not normally visible to the users.
It normalizes whitespace, but is also smarter than ``.xpath('normalize-space())``,
adding spaces around inline elements too
It normalizes whitespace, but is also smarter than
``.xpath('normalize-space())``, adding spaces around inline elements
(which are often used as block elements in html markup),
and tries to avoid adding extra spaces for punctuation.
tries to avoid adding extra spaces for punctuation and
can add newlines so that the output text looks like how it is rendered in
browsers.

Apart from just getting text from the page (e.g. for display or search),
one intended usage of this library is for machine learning (feature extraction).
Expand Down Expand Up @@ -56,26 +58,58 @@ Usage
Extract text from HTML::

>>> import html_text
>>> text = html_text.extract_text(u'<h1>Hey</h1>')
u'Hey'
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'

>>> html_text.extract_text('<h1>Hello</h1> world!', guess_layout=False)
'Hello world!'



You can also pass already parsed ``lxml.html.HtmlElement``:

>>> import html_text
>>> tree = html_text.parse_html(u'<h1>Hey</h1>')
>>> text = html_text.extract_text(tree)
u'Hey'
>>> tree = html_text.parse_html('<h1>Hello</h1> world!')
>>> html_text.extract_text(tree)
'Hello\n\nworld!'

Passed html will be first cleaned from invisible non-text content such
as styles, and then text would be extracted.
Two functions that do it are ``html_text.cleaned_selector`` and
``html_text.selector_to_text``:
Or define a selector to extract text only from specific elements:

* ``html_text.cleaned_selector`` accepts html as text or as ``lxml.html.HtmlElement``,
and returns cleaned ``parsel.Selector``.
* ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns extracted
text.
>>> import html_text
>>> sel = html_text.cleaned_selector('<h1>Hello</h1> world!')
>>> subsel = sel.xpath('//h1')
>>> html_text.selector_to_text(subsel)
'Hello'

Passed html will be first cleaned from invisible non-text content such
as styles, and then text would be extracted.
NB Selectors are not cleaned automatically you need to call
``html_text.cleaned_selector`` first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I didn't realize this, 👍


Main functions:

* ``html_text.extract_text`` accepts html and returns extracted text.
* ``html_text.cleaned_selector`` accepts html as text or as
``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``.
* ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns
extracted text.

If ``guess_layout`` is True (default), a newline is added before and after
``newline_tags``, and two newlines are added before and after
``double_newline_tags``. This heuristic makes the extracted text
more similar to how it is rendered in the browser. Default newline and double
newline tags can be found in `html_text.NEWLINE_TAGS`
and `html_text.DOUBLE_NEWLINE_TAGS`.

It is possible to customize how newlines are added, using ``newline_tags`` and
``double_newline_tags`` arguments (which are `html_text.NEWLINE_TAGS` and
`html_text.DOUBLE_NEWLINE_TAGS` by default). For example, don't add a newline
after ``<div>`` tags:

>>> newline_tags = html_text.NEWLINE_TAGS - {'div'}
>>> html_text.extract_text('<div>Hello</div> world!',
... newline_tags=newline_tags)
'Hello world!'

Credits
-------
Expand Down
3 changes: 2 additions & 1 deletion html_text/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# -*- coding: utf-8 -*-

from .html_text import extract_text, parse_html, cleaned_selector, selector_to_text
from .html_text import (extract_text, parse_html, cleaned_selector,
selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS)
156 changes: 131 additions & 25 deletions html_text/html_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@
import parsel


NEWLINE_TAGS = frozenset([
'article', 'aside', 'br', 'dd', 'details', 'div', 'dt', 'fieldset',
'figcaption', 'footer', 'form', 'header', 'hr', 'legend', 'li', 'main',
'nav', 'table', 'tr'
])
DOUBLE_NEWLINE_TAGS = frozenset([
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
'p', 'pre', 'title', 'ul'
])

_clean_html = Cleaner(
scripts=True,
javascript=False, # onclick attributes are fine
Expand Down Expand Up @@ -43,31 +53,105 @@ def parse_html(html):

_whitespace = re.compile(r'\s+')
_has_trailing_whitespace = re.compile(r'\s$').search
_has_punct_after = re.compile(r'^[,:;.!?"\)]').search
_has_punct_before = re.compile(r'\($').search

_has_punct_after = re.compile(r'^[,:;.!?")]').search
_has_open_bracket_before = re.compile(r'\($').search

def selector_to_text(sel, guess_punct_space=True):
""" Convert a cleaned selector to text.
See html_text.extract_text docstring for description of the approach and options.
"""
if guess_punct_space:

def fragments():
prev = None
for text in sel.xpath('.//text()').extract():
if prev is not None and (_has_trailing_whitespace(prev)
or (not _has_punct_after(text) and
not _has_punct_before(prev))):
yield ' '
yield text
prev = text
def _normalize_whitespace(text):
return _whitespace.sub(' ', text.strip())

return _whitespace.sub(' ', ''.join(fragments()).strip())

def _html_to_text(tree,
guess_punct_space=True,
guess_layout=True,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert a cleaned html tree to text.
See html_text.extract_text docstring for description of the approach
and options.
"""
chunks = []

_NEWLINE = object()
_DOUBLE_NEWLINE = object()

class Context:
""" workaround for missing `nonlocal` in Python 2 """
# _NEWLINE, _DOUBLE_NEWLINE or content of the previous chunk (str)
prev = _DOUBLE_NEWLINE

def should_add_space(text, prev):
""" Return True if extra whitespace should be added before text """
if prev in {_NEWLINE, _DOUBLE_NEWLINE}:
return False
if not _has_trailing_whitespace(prev):
if _has_punct_after(text) or _has_open_bracket_before(prev):
return False
return True

def get_space_between(text, prev):
if not text or not guess_punct_space:
return ' '
return ' ' if should_add_space(text, prev) else ''

def add_newlines(tag, context):
if not guess_layout:
return
prev = context.prev
if prev is _DOUBLE_NEWLINE: # don't output more than 1 blank line
return
if tag in double_newline_tags:
context.prev = _DOUBLE_NEWLINE
chunks.append('\n' if prev is _NEWLINE else '\n\n')
elif tag in newline_tags:
context.prev = _NEWLINE
if prev is not _NEWLINE:
chunks.append('\n')

def add_text(text_content, context):
text = _normalize_whitespace(text_content) if text_content else ''
if not text:
return
space = get_space_between(text, context.prev)
chunks.extend([space, text])
context.prev = text_content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if text_content happened to be \n\n or \n? Probably I'm completely missing how this works, but I though that either context.prev should be in 3 states: \n, \n\n or something else, or it must be equal to chunks[-1] (or even last chars of ''.join(chunks) to account for the case when two last chunks are \n).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If text_content is \n\n or \n, then text is empty, so function returns earlier. Can't say I understood this before you asked, that's a great question! Probably it makes sense to have the logic more explicit, though I'm not sure how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Yes, I see that such message won't reach this code part, thanks!

Probably it makes sense to have the logic more explicit, though I'm not sure how.

I see two ways (not sure how much of an improvement they are):

  • have an integer prev_newlines variable, which can have values 0 (instead of prev = text_content), 1 (instead of \n), and 2 (instead of \n\n)
  • have a enum with 3 values

But putting a comment that explains that text_content here contains some text and can not be equal to newlines is also fine by me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion @lopuhin, thanks! Did something along these lines here: 7a1b57b


def traverse_text_fragments(tree, context, handle_tail=True):
""" Extract text from the ``tree``: fill ``chunks`` variable """
add_newlines(tree.tag, context)
add_text(tree.text, context)
for child in tree:
traverse_text_fragments(child, context)
add_newlines(tree.tag, context)
if handle_tail:
add_text(tree.tail, context)

traverse_text_fragments(tree, context=Context(), handle_tail=False)
return ''.join(chunks).strip()


def selector_to_text(sel, guess_punct_space=True, guess_layout=True):
""" Convert a cleaned selector to text.
See html_text.extract_text docstring for description of the approach
and options.
"""
if isinstance(sel, parsel.SelectorList):
# if selecting a specific xpath
text = []
for s in sel:
extracted = _html_to_text(
s.root,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout)
if extracted:
text.append(extracted)
return ' '.join(text)
else:
fragments = (x.strip() for x in sel.xpath('.//text()').extract())
return _whitespace.sub(' ', ' '.join(x for x in fragments if x))
return _html_to_text(
sel.root,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout)


def cleaned_selector(html):
Expand All @@ -85,18 +169,40 @@ def cleaned_selector(html):
return sel


def extract_text(html, guess_punct_space=True):
def extract_text(html,
guess_punct_space=True,
guess_layout=True,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert html to text, cleaning invisible content such as styles.

Almost the same as normalize-space xpath, but this also
adds spaces between inline elements (like <span>) which are
often used as block elements in html markup.
often used as block elements in html markup, and adds appropriate
newlines to make output better formatted.

html should be a unicode string or an already parsed lxml.html element.

When guess_punct_space is True (default), no extra whitespace is added
for punctuation. This has a slight (around 10%) performance overhead
and is just a heuristic.

html should be a unicode string or an already parsed lxml.html element.
When guess_layout is True (default), a newline is added
before and after ``newline_tags`` and two newlines are added before
and after ``double_newline_tags``. This heuristic makes the extracted
text more similar to how it is rendered in the browser.

Default newline and double newline tags can be found in
`html_text.NEWLINE_TAGS` and `html_text.DOUBLE_NEWLINE_TAGS`.
"""
sel = cleaned_selector(html)
return selector_to_text(sel, guess_punct_space=guess_punct_space)
if html is None or len(html) == 0:
return ''
cleaned = _cleaned_html_tree(html)
return _html_to_text(
cleaned,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout,
newline_tags=newline_tags,
double_newline_tags=double_newline_tags,
)
Loading