Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-24665: double-width CJK chars support for textwrap #89

Closed
wants to merge 8 commits into from

Conversation

fgallaire
Copy link

@fgallaire fgallaire commented Feb 14, 2017

  • Add ckj option flag, default to False
  • Add cjk_wide(), cjk_len() and cjk_slices() utilities

* Add ckj option flag, default to False
* Add cjkwide(), cjklen() and cjkslices() utilities
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:

  1. If you don't have an account on b.p.o, please create one
  2. Make sure your GitHub username is listed in "Your Details" at b.p.o
  3. If you have not already done so, please sign the PSF contributor agreement
  4. If you just signed the CLA, please wait at least one US business day and then check "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
  5. Reply here saying you have completed the above steps

Thanks again to your contribution and we look forward to looking at it!

@fgallaire
Copy link
Author

fgallaire commented Feb 14, 2017

I now have listed my GitHub username at bpo and i had already signed the CLA.

Lib/textwrap.py Outdated
@@ -114,6 +117,7 @@ class TextWrapper:

def __init__(self,
width=70,
cjk=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arguments can be passed as positional argument.
So new argument should be added last.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Lib/textwrap.py Outdated
i = 1
# <= and i-1 to catch the last double length char of odd line
while cjklen(text[:i]) <= index:
i = i + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this O(n^2) algorithm.

w = 0
for i, c in enumerate(text):
    w += cjkwide(c) + 1
    if w > index:
        break

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very relevant point.

@fgallaire fgallaire force-pushed the master branch 2 times, most recently from e52db22 to 5c4b2af Compare February 14, 2017 09:26
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to document the name option in Doc/library/textwrap.rst, don't forget ".. versionchanged:: 3.7" markup.

Lib/textwrap.py Outdated
@@ -139,6 +145,7 @@ def __init__(self,
self.max_lines = max_lines
self.placeholder = placeholder

self.len = cjklen if self.cjk else len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make it private and use a different name than len: self._text_width() for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@@ -365,7 +380,7 @@ def fill(self, text):

# -- Convenience interface ---------------------------------------------

def wrap(text, width=70, **kwargs):
def wrap(text, width=70, cjk=False, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not need to repeat cjk here, there is already a generic **kwargs. Same for other functions below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same need as width: to be visible so easily usable.

Lib/textwrap.py Outdated
"""Return True if char is Fullwidth or Wide, False otherwise.
Fullwidth and Wide CJK chars are double-width.
"""
return unicodedata.east_asian_width(char) in ('F', 'W')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write {'F', 'W'}, it's optimized as a constant frozenset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really faster than a tuple ?

Lib/textwrap.py Outdated
return w.fill(' '.join(text.strip().split()))


# -- CJK support ------------------------------------------------------

def cjkwide(char):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add _ in the name: cjk_wide().

Note sure about the name: is_cjk_wide_char()?

Do we really need to make the function?

Lib/textwrap.py Outdated
return unicodedata.east_asian_width(char) in ('F', 'W')


def cjklen(text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to cfk_width()?

Lib/textwrap.py Outdated
"""Return the real width of text (its len if not a string).
"""
if not isinstance(text, str):
return len(text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this case. Why do you pass a non-string to the function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for people who want to do:
from textwrap import cjklen as len
and use cjklen transparently

Lib/textwrap.py Outdated
return sum(2 if cjkwide(char) else 1 for char in text)


def cjkslices(text, index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's better to make all these new functions private: add _ prefix. Rename to _cjk_slices().

"index": is it a number of character or a width? Maybe rename to width?

Copy link
Author

@fgallaire fgallaire Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I really want this functions to be exported, they are useful

@vstinner
Copy link
Member

By email I proposed a different design. Make existing TextWrapper extensible: add text_width() and truncate() methods. Then add a new TextWrapperCJK which overrides these methods. It would allow to reuse the code for other cases than just CJK.

Lib/textwrap.py Outdated
# Written by Greg Ward <[email protected]>

import re
import re, unicodedata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One import per line please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Lib/textwrap.py Outdated
@@ -139,6 +145,7 @@ def __init__(self,
self.max_lines = max_lines
self.placeholder = placeholder

self.len = cjklen if self.cjk else len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.

@vstinner
Copy link
Member

Your change breaks Python build: Python requires optparse to compile modules like unicodedata, optparse imports textwrap which now always requires unicodedata.

Using two different classes, it would be simpler to only import unicodedata when the TextWrapperCJK class is instanciated, "on demand", and so fix the bootstrap issue.

@fgallaire
Copy link
Author

fgallaire commented Feb 15, 2017

What about a depreciation warning to inform that cjk default will switch to True in Python 3.8 ?

@fgallaire
Copy link
Author

fgallaire commented Feb 15, 2017

optparse is deprecated since Python 3.2 so it should not drive this work, but a port to argparse will not solve this problem because of the same cycling dependency.

@codecov
Copy link

codecov bot commented Feb 15, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@984eef7). Click here to learn what that means.
The diff coverage is 72.5%.

@@            Coverage Diff            @@
##             master      #89   +/-   ##
=========================================
  Coverage          ?   82.38%           
=========================================
  Files             ?     1428           
  Lines             ?   351193           
  Branches          ?        0           
=========================================
  Hits              ?   289333           
  Misses            ?    61860           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 984eef7...54de7aa. Read the comment docs.

Copy link
Member

@methane methane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree supporting east_asian_wdith is preferable, I don't like current API.
It's OK for third party library. But for standard library, I prefer more generic API and implementation.

While "practical beats purity", I want glibc's wcwidth for UTF-8 at least. (bonus: option like ambiwidth=double in vim)


.. function:: cjk_len(text)

Return the real width of *text* (its len if not a string).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like wording real.
How about zero-width space? How about combining character sequence? variation selector?
EMOJI MODIFIER?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right. The real purpose is to help to understand that width (cjk_len) and len are different. What do you think about visual instead ?

@yan12125
Copy link
Contributor

I want glibc's wcwidth for UTF-8 at least

Disagreed. I guess text editors and terminals are top two applications in textwrap. I known little about text editors. In terminals, things are quite interesting. Different terminals use different versions of EastAsianWidth.txt. Here are some examples:

  • tmux, QTerminal: use wcwidth() by default. In the latest glibc, it's still Unicode 8.0. [1] On Mac, it's reported that wcwidth() is broken [2].
  • VTE (gnome-terminal, xfce4-terminal, etc.): Unicode 9.0 [3]
  • Konsole: Unicode 5.0 [4]

Different Unicode versions can lead to different results for textwrap. Take U+231A (clock emoji) for example. In Unicode 8.0 it's NEUTRAL. Most implementations see it as width=1. In Unicode 9.0 it's changed to WIDE.

In CPython, handling multiple Unicode versions sounds impractical. IMO chasing the latest Unicode 9.0 is a good idea. If a terminal emulator is not compatible with Unicode 9.0, it should be fixed.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=20313
[2] tmux/tmux#515
[3] https://bugzilla.gnome.org/show_bug.cgi?id=771591
[4] https://github.com/KDE/konsole/blob/master/src/konsole_wcwidth.cpp#L55

@yan12125
Copy link
Contributor

Also, I would recommend rename cjk_foobar stuffs to unicode_foobar. CJK characters occupy a large portion on Unicode tables but not the whole.

@methane
Copy link
Member

methane commented Feb 16, 2017

@yan12125 sorry, my wording was wrong. (I'm not good at English).
When I said "I want glibc's wcwidth for UTF-8 at least", I meant
"if textwrap supports display width, I think it should be good as wcwidth at least.".
I didn't meant neither "textwrap should use wcwidth" nor "textwrap should implement algorithm
exactly same to wcwidth."

BTW, pull request is not place for discussion like this.
Please go to http://bugs.python.org/issue24665

Fullwidth and Wide CJK chars are double-width.
"""
import unicodedata
return unicodedata.east_asian_width(char) in ('F', 'W')

This comment was marked as resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, @duboviy or @Haypo, could you explain to me why/how a tuple could be slower than a frozenset.

def cjk_len(text):
"""Return the real width of text (its len if not a string).
"""
if not isinstance(text, str):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange case handling, maybe we should expect only string type text argument in this function...

Copy link
Author

@fgallaire fgallaire Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: it's for an handy replacement of the built-in len():
from textwrap import cjk_len as len

@@ -495,6 +495,7 @@ Lele Gaifax
Santiago Gala
Yitzchak Gale
Matthew Gallagher
Florent Gallaire
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that makes lots of sense for such a small change (even though such changes are very welcome of course!)

There are commits from loads of people in this repo, but way less entries in this file (adding every contributor would quickly make the file pretty chaotic).

IMO it'd be better to only add yourself to this file when you contribute e.g. a major fix, new feature, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree, if persons are missing they should be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Florent definitely belongs in ACKS. We have been fairly liberal in adding people, some for as little as a well-crafted sentence or line of code.

Lib/textwrap.py Outdated
@@ -428,6 +427,7 @@ def cjk_wide(char):
"""Return True if char is Fullwidth or Wide, False otherwise.
Fullwidth and Wide CJK chars are double-width.
"""
import unicodedata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think importing unicodedata, when wanted, for each char, is wrong. If not imported at the top, it could be imported as a global in TextWrapper.init when cjk is true. (I am assuming that the convenience functions all instantiate TextWrapper.

akruis pushed a commit to akruis/cpython that referenced this pull request Sep 9, 2017
…er shutdown code

Fix the implementation to work as documented if a thread dies. Now
Stackless kills only tasklets with a nesting level > 0. During interpreter
shutdown stackless additionally kills daemon threads, if they execute Python code or
switch tasklets. This prevents access violations after clearing the thread states.

Add a 1ms sleep for each daemon thread. This way the thread gets a better
chance to run. Thanks to Kristján Valur Jónsson for suggesting this improvement.

Add a test case for a C assertion violation during thread shutdown.
Add some explanatory comments to the shutdown test case.
Thanks to Christian Tismer for the sugestion.

https://bitbucket.org/stackless-dev/stackless/issues/81
https://bitbucket.org/stackless-dev/stackless/issues/89
(grafted from 2cc41427a347a1ebec4eedc3db06a3664e67f798, 1388a2003957,
6140f5aaca2c and edc9b92ec457)
akruis pushed a commit to akruis/cpython that referenced this pull request Sep 9, 2017
@brettcannon
Copy link
Member

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

@methane
Copy link
Member

methane commented Jul 8, 2018

I close this because I don't like APIs this PR has and the author seems satisfied by
putting his library on PyPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants