-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-24665: double-width CJK chars support for textwrap #89
Conversation
fgallaire
commented
Feb 14, 2017
•
edited
Loading
edited
- Add ckj option flag, default to False
- Add cjk_wide(), cjk_len() and cjk_slices() utilities
* Add ckj option flag, default to False * Add cjkwide(), cjklen() and cjkslices() utilities
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:
Thanks again to your contribution and we look forward to looking at it! |
I now have listed my GitHub username at bpo and i had already signed the CLA. |
Lib/textwrap.py
Outdated
@@ -114,6 +117,7 @@ class TextWrapper: | |||
|
|||
def __init__(self, | |||
width=70, | |||
cjk=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arguments can be passed as positional argument.
So new argument should be added last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
Lib/textwrap.py
Outdated
i = 1 | ||
# <= and i-1 to catch the last double length char of odd line | ||
while cjklen(text[:i]) <= index: | ||
i = i + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this O(n^2) algorithm.
w = 0
for i, c in enumerate(text):
w += cjkwide(c) + 1
if w > index:
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very relevant point.
e52db22
to
5c4b2af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to document the name option in Doc/library/textwrap.rst, don't forget ".. versionchanged:: 3.7" markup.
Lib/textwrap.py
Outdated
@@ -139,6 +145,7 @@ def __init__(self, | |||
self.max_lines = max_lines | |||
self.placeholder = placeholder | |||
|
|||
self.len = cjklen if self.cjk else len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to make it private and use a different name than len: self._text_width() for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
@@ -365,7 +380,7 @@ def fill(self, text): | |||
|
|||
# -- Convenience interface --------------------------------------------- | |||
|
|||
def wrap(text, width=70, **kwargs): | |||
def wrap(text, width=70, cjk=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not need to repeat cjk here, there is already a generic **kwargs. Same for other functions below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same need as width: to be visible so easily usable.
Lib/textwrap.py
Outdated
"""Return True if char is Fullwidth or Wide, False otherwise. | ||
Fullwidth and Wide CJK chars are double-width. | ||
""" | ||
return unicodedata.east_asian_width(char) in ('F', 'W') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write {'F', 'W'}, it's optimized as a constant frozenset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really faster than a tuple ?
Lib/textwrap.py
Outdated
return w.fill(' '.join(text.strip().split())) | ||
|
||
|
||
# -- CJK support ------------------------------------------------------ | ||
|
||
def cjkwide(char): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add _ in the name: cjk_wide().
Note sure about the name: is_cjk_wide_char()?
Do we really need to make the function?
Lib/textwrap.py
Outdated
return unicodedata.east_asian_width(char) in ('F', 'W') | ||
|
||
|
||
def cjklen(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to cfk_width()?
Lib/textwrap.py
Outdated
"""Return the real width of text (its len if not a string). | ||
""" | ||
if not isinstance(text, str): | ||
return len(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this case. Why do you pass a non-string to the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for people who want to do:
from textwrap import cjklen as len
and use cjklen transparently
Lib/textwrap.py
Outdated
return sum(2 if cjkwide(char) else 1 for char in text) | ||
|
||
|
||
def cjkslices(text, index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it's better to make all these new functions private: add _ prefix. Rename to _cjk_slices().
"index": is it a number of character or a width? Maybe rename to width?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I really want this functions to be exported, they are useful
By email I proposed a different design. Make existing TextWrapper extensible: add text_width() and truncate() methods. Then add a new TextWrapperCJK which overrides these methods. It would allow to reuse the code for other cases than just CJK. |
Lib/textwrap.py
Outdated
# Written by Greg Ward <[email protected]> | ||
|
||
import re | ||
import re, unicodedata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One import per line please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
Lib/textwrap.py
Outdated
@@ -139,6 +145,7 @@ def __init__(self, | |||
self.max_lines = max_lines | |||
self.placeholder = placeholder | |||
|
|||
self.len = cjklen if self.cjk else len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.
Your change breaks Python build: Python requires optparse to compile modules like unicodedata, optparse imports textwrap which now always requires unicodedata. Using two different classes, it would be simpler to only import unicodedata when the TextWrapperCJK class is instanciated, "on demand", and so fix the bootstrap issue. |
What about a depreciation warning to inform that cjk default will switch to |
optparse is deprecated since Python 3.2 so it should not drive this work, but a port to argparse will not solve this problem because of the same cycling dependency. |
Codecov Report
@@ Coverage Diff @@
## master #89 +/- ##
=========================================
Coverage ? 82.38%
=========================================
Files ? 1428
Lines ? 351193
Branches ? 0
=========================================
Hits ? 289333
Misses ? 61860
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree supporting east_asian_wdith is preferable, I don't like current API.
It's OK for third party library. But for standard library, I prefer more generic API and implementation.
While "practical beats purity", I want glibc's wcwidth for UTF-8 at least. (bonus: option like ambiwidth=double in vim)
|
||
.. function:: cjk_len(text) | ||
|
||
Return the real width of *text* (its len if not a string). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like wording real.
How about zero-width space? How about combining character sequence? variation selector?
EMOJI MODIFIER?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right. The real purpose is to help to understand that width (cjk_len) and len are different. What do you think about visual instead ?
Disagreed. I guess text editors and terminals are top two applications in textwrap. I known little about text editors. In terminals, things are quite interesting. Different terminals use different versions of
Different Unicode versions can lead to different results for In CPython, handling multiple Unicode versions sounds impractical. IMO chasing the latest Unicode 9.0 is a good idea. If a terminal emulator is not compatible with Unicode 9.0, it should be fixed. [1] https://sourceware.org/bugzilla/show_bug.cgi?id=20313 |
Also, I would recommend rename cjk_foobar stuffs to unicode_foobar. CJK characters occupy a large portion on Unicode tables but not the whole. |
@yan12125 sorry, my wording was wrong. (I'm not good at English). BTW, pull request is not place for discussion like this. |
Fullwidth and Wide CJK chars are double-width. | ||
""" | ||
import unicodedata | ||
return unicodedata.east_asian_width(char) in ('F', 'W') |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def cjk_len(text): | ||
"""Return the real width of text (its len if not a string). | ||
""" | ||
if not isinstance(text, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange case handling, maybe we should expect only string type text argument in this function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: it's for an handy replacement of the built-in len()
:
from textwrap import cjk_len as len
@@ -495,6 +495,7 @@ Lele Gaifax | |||
Santiago Gala | |||
Yitzchak Gale | |||
Matthew Gallagher | |||
Florent Gallaire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that makes lots of sense for such a small change (even though such changes are very welcome of course!)
There are commits from loads of people in this repo, but way less entries in this file (adding every contributor would quickly make the file pretty chaotic).
IMO it'd be better to only add yourself to this file when you contribute e.g. a major fix, new feature, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree, if persons are missing they should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Florent definitely belongs in ACKS. We have been fairly liberal in adding people, some for as little as a well-crafted sentence or line of code.
Lib/textwrap.py
Outdated
@@ -428,6 +427,7 @@ def cjk_wide(char): | |||
"""Return True if char is Fullwidth or Wide, False otherwise. | |||
Fullwidth and Wide CJK chars are double-width. | |||
""" | |||
import unicodedata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think importing unicodedata, when wanted, for each char, is wrong. If not imported at the top, it could be imported as a global in TextWrapper.init when cjk is true. (I am assuming that the convenience functions all instantiate TextWrapper.
…er shutdown code Fix the implementation to work as documented if a thread dies. Now Stackless kills only tasklets with a nesting level > 0. During interpreter shutdown stackless additionally kills daemon threads, if they execute Python code or switch tasklets. This prevents access violations after clearing the thread states. Add a 1ms sleep for each daemon thread. This way the thread gets a better chance to run. Thanks to Kristján Valur Jónsson for suggesting this improvement. Add a test case for a C assertion violation during thread shutdown. Add some explanatory comments to the shutdown test case. Thanks to Christian Tismer for the sugestion. https://bitbucket.org/stackless-dev/stackless/issues/81 https://bitbucket.org/stackless-dev/stackless/issues/89 (grafted from 2cc41427a347a1ebec4eedc3db06a3664e67f798, 1388a2003957, 6140f5aaca2c and edc9b92ec457)
To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request. If/when the requested changes have been made, please leave a comment that says, |
I close this because I don't like APIs this PR has and the author seems satisfied by |