-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unwanted joins for inline tags #2
Conversation
Thanks @codinguncut for suggestion. Still needs testing. re.sub is replicating xpath's normalize-space behaviour. See GH-1
python 2 does not cache re.sub regexps, and it's faster even on python 3
Codecov Report
@@ Coverage Diff @@
## master #2 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 2 2
Lines 26 42 +16
Branches 1 6 +5
=====================================
+ Hits 26 42 +16
Continue to review full report at Codecov.
|
in my form github/fluquid/html-text I tried the following: But it doesnt yet seem to work propery with within-string multi-whitespace, or newlines/tabs for that matter. your code looks fine, but understandable that re.sub wld add overhead. Johannes |
@codinguncut it seems that the main overhead comes not from re.sub, but from iterating over
That's an interesting approach, I'll check it out, thanks! |
i'll try also, both solutions will add spaces before comma, period, etc. in the context of tags: |
To handle punctuation there is https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61, but it may add even more overhead. It may be fine to provide it as an option though. |
Ah, and it also removes all spaces before punctuation, no only caused by joining, so maybe it is not the way to go. |
d17ec6c
to
e833357
Compare
@codinguncut to be honest, I didn't understand this xpath - it just returns all elements with some text, right?
Nice suggestion, thanks @kmike ! I implemented this as an option in e833357 (edit: f020f4b), it works only on tag boundaries, so only spaces caused by joining would be affected. The overhead is not that huge, for 1k pages total time is 11.7 s vs 12.5 s (and 8.17 s vs 9.21 s when working on already parsed trees). Maybe it's ok to make it default? |
This is similar to webstruct.utils.smart_joins (https://github.com/scrapinghub/webstruct/blob/5a3f39e2ec78a04ca021a12dff58f66686d86251/webstruct/utils.py#L61), but is applied only on the tag boundaries. This mode is just a little bit slower than default.
e833357
to
f020f4b
Compare
It's fine to apply whitespace cleaning regexp at the end
Nice! +1 to enable punctuation handling by default. _trailing_whitespace = re.compile(r'\s$')
# ...
if _trailing_whitespace.search(...): One can write this, to save an attribute lookup in a tight loop: _has_trailing_whitespace = re.compile(r'\s$').search
# ...
if _has_trailing_whitespace(...): |
d71a8ee
to
1fb2ec4
Compare
Thanks @codinguncut and @kmike , merged with punctuation handling enabled by default. |
yes, i still don't 100% understand xpath syntax. |
Maybe @redapple can share his experience. I think this issue is very much related to scrapy/parsel#34. |
|
||
def fragments(): | ||
prev = None | ||
for text in sel.xpath('//text()').extract(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend using './/text()'
so that it can be used for any selector, and not only those coming from extract_text(html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea, thanks @redapple - I'd like to also make it possible to pass selectors via the public interface.
@codinguncut ,
...on top of
And indeed, @lopuhin , @kmike , this may not be the right issue for my comment here as it may be more about doing html2text-ish transformation than plaintext, but what I found handy in the past was:
You can find some (ugly, pre-knowing-of-normalize-space) code in parslepy. |
@codecov-io @lopuhin , you may also be interested in this answer I wrote some time ago on whitespace and XPath's |
Fixes #1 - see examples by @codinguncut there. Inline tags are commonly used as block tags, and current
normalize-space()
results in unwanted joining of words - this branch fixes it by always adding whitespace between tags.Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.
The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.
@codinguncut @kmike I would appreciate your review :) I have some vague memory that in some cases
//text()
is not what we want, but I can't recall in which and I didn't see anything bad in the tests I did.