-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to guess page layout (try to preserve some of the formatting) #11
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11 +/- ##
==========================================
- Coverage 100% 97.82% -2.18%
==========================================
Files 2 2
Lines 42 92 +50
Branches 6 17 +11
==========================================
+ Hits 42 90 +48
- Misses 0 2 +2
Continue to review full report at Codecov.
|
* prev is always a string now, never a 1-element list * unify newline and text handling between text and tail * another workaround for mutable variable in the outer scope (Context class) * append to a list instead of using a generator
hey @Kebniss! The refactoring is ready; could you please run your benchmark on it, to check that it is not much slower? |
guess_punct_space doesn't provide good output in this case => xfail
Performance now is better than before in 2/3 cases :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great feature to add, and I like the docs update.
I think it's fine to enable guess_layout
by default (or at least not take backwards compatibility into account when deciding whether to leave it on or off).
To be honest I didn't quite get some details in the algorithm which handles newlines, I left a question, and will read into it again.
Two functions that do it are ``html_text.cleaned_selector`` and | ||
``html_text.selector_to_text``: | ||
NB Selectors are not cleaned automatically you need to call | ||
``html_text.cleaned_selector`` first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I didn't realize this, 👍
------------------ | ||
|
||
* ``guess_layout`` option to to make extracted text look more like how | ||
it is rendered in browser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we enable it by default? There is no speed hit and the text looks nicer (almost always).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also wonder about this. Let's enable it by default.
html_text/html_text.py
Outdated
|
||
class Context: | ||
""" workaround for missing `nonlocal` in Python 2 """ | ||
prev = '\n\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can chunks[-1]
be used instead? Is this added to avoid checking the case when chunks
is empty? Edit: actually I see it's not always equal, please see next comment and sorry for any misunderstanding on my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually tried it, and was unable to make it work, at least quick enough :)
return | ||
space = get_space_between(text, context.prev) | ||
chunks.extend([space, text]) | ||
context.prev = text_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if text_content
happened to be \n\n
or \n
? Probably I'm completely missing how this works, but I though that either context.prev
should be in 3 states: \n
, \n\n
or something else, or it must be equal to chunks[-1]
(or even last chars of ''.join(chunks)
to account for the case when two last chunks are \n
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If text_content
is \n\n
or \n
, then text
is empty, so function returns earlier. Can't say I understood this before you asked, that's a great question! Probably it makes sense to have the logic more explicit, though I'm not sure how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Yes, I see that such message won't reach this code part, thanks!
Probably it makes sense to have the logic more explicit, though I'm not sure how.
I see two ways (not sure how much of an improvement they are):
- have an integer
prev_newlines
variable, which can have values 0 (instead ofprev = text_content
), 1 (instead of\n
), and 2 (instead of\n\n
) - have a enum with 3 values
But putting a comment that explains that text_content
here contains some text and can not be equal to newlines is also fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ment text directly
not worths it to use six.string_types
This is a follow-up to #9, with minor tweaks.