bpo-24665: double-width CJK chars support for textwrap #89

fgallaire · 2017-02-14T05:45:54Z

Add ckj option flag, default to False
Add cjk_wide(), cjk_len() and cjk_slices() utilities

* Add ckj option flag, default to False * Add cjkwide(), cjklen() and cjkslices() utilities

the-knights-who-say-ni · 2017-02-14T05:45:55Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:

If you don't have an account on b.p.o, please create one
Make sure your GitHub username is listed in "Your Details" at b.p.o
If you have not already done so, please sign the PSF contributor agreement
If you just signed the CLA, please wait at least one US business day and then check "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
Reply here saying you have completed the above steps

Thanks again to your contribution and we look forward to looking at it!

fgallaire · 2017-02-14T05:57:17Z

I now have listed my GitHub username at bpo and i had already signed the CLA.

methane · 2017-02-14T08:11:28Z

Lib/textwrap.py

@@ -114,6 +117,7 @@ class TextWrapper:

    def __init__(self,
                 width=70,
+                 cjk=False,


These arguments can be passed as positional argument.
So new argument should be added last.

methane · 2017-02-14T08:17:35Z

Lib/textwrap.py

+    i = 1
+    # <= and i-1 to catch the last double length char of odd line
+    while cjklen(text[:i]) <= index:
+        i = i + 1


I don't like this O(n^2) algorithm.

w = 0 for i, c in enumerate(text): w += cjkwide(c) + 1 if w > index: break

Very relevant point.

vstinner

You need to document the name option in Doc/library/textwrap.rst, don't forget ".. versionchanged:: 3.7" markup.

vstinner · 2017-02-14T09:36:00Z

Lib/textwrap.py

@@ -139,6 +145,7 @@ def __init__(self,
        self.max_lines = max_lines
        self.placeholder = placeholder

+        self.len = cjklen if self.cjk else len


I suggest to make it private and use a different name than len: self._text_width() for example.

Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.

vstinner · 2017-02-14T09:36:53Z

Lib/textwrap.py

@@ -365,7 +380,7 @@ def fill(self, text):

 # -- Convenience interface ---------------------------------------------

-def wrap(text, width=70, **kwargs):
+def wrap(text, width=70, cjk=False, **kwargs):


There is not need to repeat cjk here, there is already a generic **kwargs. Same for other functions below.

This is the same need as width: to be visible so easily usable.

vstinner · 2017-02-14T09:38:48Z

Lib/textwrap.py

+    """Return True if char is Fullwidth or Wide, False otherwise.
+    Fullwidth and Wide CJK chars are double-width.
+    """
+    return unicodedata.east_asian_width(char) in ('F', 'W')


You can write {'F', 'W'}, it's optimized as a constant frozenset.

It's really faster than a tuple ?

vstinner · 2017-02-14T09:39:25Z

Lib/textwrap.py

    return w.fill(' '.join(text.strip().split()))


+# -- CJK support ------------------------------------------------------
+
+def cjkwide(char):


Please add _ in the name: cjk_wide().

Note sure about the name: is_cjk_wide_char()?

Do we really need to make the function?

vstinner · 2017-02-14T09:39:43Z

Lib/textwrap.py

+    return unicodedata.east_asian_width(char) in ('F', 'W')
+
+
+def cjklen(text):


Rename to cfk_width()?

vstinner · 2017-02-14T09:40:12Z

Lib/textwrap.py

+    """Return the real width of text (its len if not a string).
+    """
+    if not isinstance(text, str):
+        return len(text)


I don't understand this case. Why do you pass a non-string to the function?

This is for people who want to do:
from textwrap import cjklen as len
and use cjklen transparently

vstinner · 2017-02-14T09:41:11Z

Lib/textwrap.py

+    return sum(2 if cjkwide(char) else 1 for char in text)
+
+
+def cjkslices(text, index):


IMHO it's better to make all these new functions private: add _ prefix. Rename to _cjk_slices().

"index": is it a number of character or a width? Maybe rename to width?

No, I really want this functions to be exported, they are useful

vstinner · 2017-02-14T09:46:36Z

By email I proposed a different design. Make existing TextWrapper extensible: add text_width() and truncate() methods. Then add a new TextWrapperCJK which overrides these methods. It would allow to reuse the code for other cases than just CJK.

vstinner · 2017-02-14T09:47:22Z

Lib/textwrap.py

 # Written by Greg Ward <[email protected]>

-import re
+import re, unicodedata


One import per line please.

vstinner · 2017-02-14T09:49:12Z

Lib/textwrap.py

@@ -139,6 +145,7 @@ def __init__(self,
        self.max_lines = max_lines
        self.placeholder = placeholder

+        self.len = cjklen if self.cjk else len


Exposing functions mean that you must document them, write unit tests, and that someone (maybe not you) have to maintain them. I don't think that it's worth it. Let's try with something simple, make these functions private.

vstinner · 2017-02-14T09:51:25Z

Your change breaks Python build: Python requires optparse to compile modules like unicodedata, optparse imports textwrap which now always requires unicodedata.

Using two different classes, it would be simpler to only import unicodedata when the TextWrapperCJK class is instanciated, "on demand", and so fix the bootstrap issue.

fgallaire · 2017-02-15T01:16:58Z

What about a depreciation warning to inform that cjk default will switch to True in Python 3.8 ?

fgallaire · 2017-02-15T02:05:55Z

optparse is deprecated since Python 3.2 so it should not drive this work, but a port to argparse will not solve this problem because of the same cycling dependency.

…Naoki)

codecov · 2017-02-15T03:09:40Z

Codecov Report

❗ No coverage uploaded for pull request base (master@984eef7). Click here to learn what that means.
The diff coverage is 72.5%.

@@            Coverage Diff            @@
##             master      #89   +/-   ##
=========================================
  Coverage          ?   82.38%           
=========================================
  Files             ?     1428           
  Lines             ?   351193           
  Branches          ?        0           
=========================================
  Hits              ?   289333           
  Misses            ?    61860           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 984eef7...54de7aa. Read the comment docs.

methane

While I agree supporting east_asian_wdith is preferable, I don't like current API.
It's OK for third party library. But for standard library, I prefer more generic API and implementation.

While "practical beats purity", I want glibc's wcwidth for UTF-8 at least. (bonus: option like ambiwidth=double in vim)

methane · 2017-02-15T05:43:55Z

Doc/library/textwrap.rst

+
+.. function:: cjk_len(text)
+
+   Return the real width of *text* (its len if not a string).


I don't like wording real.
How about zero-width space? How about combining character sequence? variation selector?
EMOJI MODIFIER?

You're probably right. The real purpose is to help to understand that width (cjk_len) and len are different. What do you think about visual instead ?

yan12125 · 2017-02-16T17:24:03Z

I want glibc's wcwidth for UTF-8 at least

Disagreed. I guess text editors and terminals are top two applications in textwrap. I known little about text editors. In terminals, things are quite interesting. Different terminals use different versions of EastAsianWidth.txt. Here are some examples:

tmux, QTerminal: use wcwidth() by default. In the latest glibc, it's still Unicode 8.0. [1] On Mac, it's reported that wcwidth() is broken [2].
VTE (gnome-terminal, xfce4-terminal, etc.): Unicode 9.0 [3]
Konsole: Unicode 5.0 [4]

Different Unicode versions can lead to different results for textwrap. Take U+231A (clock emoji) for example. In Unicode 8.0 it's NEUTRAL. Most implementations see it as width=1. In Unicode 9.0 it's changed to WIDE.

In CPython, handling multiple Unicode versions sounds impractical. IMO chasing the latest Unicode 9.0 is a good idea. If a terminal emulator is not compatible with Unicode 9.0, it should be fixed.

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=20313
[2] tmux/tmux#515
[3] https://bugzilla.gnome.org/show_bug.cgi?id=771591
[4] https://github.com/KDE/konsole/blob/master/src/konsole_wcwidth.cpp#L55

yan12125 · 2017-02-16T17:26:55Z

Also, I would recommend rename cjk_foobar stuffs to unicode_foobar. CJK characters occupy a large portion on Unicode tables but not the whole.

methane · 2017-02-16T18:36:08Z

@yan12125 sorry, my wording was wrong. (I'm not good at English).
When I said "I want glibc's wcwidth for UTF-8 at least", I meant
"if textwrap supports display width, I think it should be good as wcwidth at least.".
I didn't meant neither "textwrap should use wcwidth" nor "textwrap should implement algorithm
exactly same to wcwidth."

BTW, pull request is not place for discussion like this.
Please go to http://bugs.python.org/issue24665

fgallaire · 2017-03-12T11:52:08Z

Lib/textwrap.py

+    Fullwidth and Wide CJK chars are double-width.
+    """
+    import unicodedata
+    return unicodedata.east_asian_width(char) in ('F', 'W')


Please, @duboviy or @Haypo, could you explain to me why/how a tuple could be slower than a frozenset.

duboviy · 2017-03-11T21:20:13Z

Lib/textwrap.py

+def cjk_len(text):
+    """Return the real width of text (its len if not a string).
+    """
+    if not isinstance(text, str):


Strange case handling, maybe we should expect only string type text argument in this function...

Again: it's for an handy replacement of the built-in len():
from textwrap import cjk_len as len

duboviy · 2017-03-11T21:25:32Z

Misc/ACKS

@@ -495,6 +495,7 @@ Lele Gaifax
 Santiago Gala
 Yitzchak Gale
 Matthew Gallagher
+Florent Gallaire


Not sure if that makes lots of sense for such a small change (even though such changes are very welcome of course!)

There are commits from loads of people in this repo, but way less entries in this file (adding every contributor would quickly make the file pretty chaotic).

IMO it'd be better to only add yourself to this file when you contribute e.g. a major fix, new feature, etc

I strongly disagree, if persons are missing they should be added.

Florent definitely belongs in ACKS. We have been fairly liberal in adding people, some for as little as a well-crafted sentence or line of code.

terryjreedy · 2017-04-27T19:32:23Z

Lib/textwrap.py

@@ -428,6 +427,7 @@ def cjk_wide(char):
    """Return True if char is Fullwidth or Wide, False otherwise.
    Fullwidth and Wide CJK chars are double-width.
    """
+    import unicodedata


I think importing unicodedata, when wanted, for each char, is wrong. If not imported at the top, it could be imported as a global in TextWrapper.init when cjk is true. (I am assuming that the convenience functions all instantiate TextWrapper.

…er shutdown code Fix the implementation to work as documented if a thread dies. Now Stackless kills only tasklets with a nesting level > 0. During interpreter shutdown stackless additionally kills daemon threads, if they execute Python code or switch tasklets. This prevents access violations after clearing the thread states. Add a 1ms sleep for each daemon thread. This way the thread gets a better chance to run. Thanks to Kristján Valur Jónsson for suggesting this improvement. Add a test case for a C assertion violation during thread shutdown. Add some explanatory comments to the shutdown test case. Thanks to Christian Tismer for the sugestion. https://bitbucket.org/stackless-dev/stackless/issues/81 https://bitbucket.org/stackless-dev/stackless/issues/89 (grafted from 2cc41427a347a1ebec4eedc3db06a3664e67f798, 1388a2003957, 6140f5aaca2c and edc9b92ec457)

brettcannon · 2018-02-21T01:53:27Z

To try and help move older pull requests forward, we are going through and backfilling 'awaiting' labels on pull requests that are lacking the label. Based on the current reviews, the best we can tell in an automated fashion is that a core developer requested changes to be made to this pull request.

If/when the requested changes have been made, please leave a comment that says, I have made the requested changes; please review again. That will trigger a bot to flag this pull request as ready for a follow-up review.

methane · 2018-07-08T17:38:37Z

I close this because I don't like APIs this PR has and the author seems satisfied by
putting his library on PyPI.

bpo-24665: double-width CJK chars support for textwrap

6ea78e3

* Add ckj option flag, default to False * Add cjkwide(), cjklen() and cjkslices() utilities

the-knights-who-say-ni added the CLA not signed label Feb 14, 2017

zware removed the CLA not signed label Feb 14, 2017

the-knights-who-say-ni added the CLA signed label Feb 14, 2017

methane requested changes Feb 14, 2017

View reviewed changes

fgallaire force-pushed the master branch 2 times, most recently from e52db22 to 5c4b2af Compare February 14, 2017 09:26

vstinner requested changes Feb 14, 2017

View reviewed changes

fgallaire force-pushed the master branch from 4745076 to 5c4b2af Compare February 14, 2017 10:19

fgallaire added 7 commits February 15, 2017 03:32

Fix TextWrapper positionnal arguments

aa94f26

Fix one import per line

0264d9d

Rename self.len() in self._width()

d630821

Rename CJK functions with _

bfdfb22

Improve cjk_slices() complexity from O(n^2) to O(n) (Thanks to INADA …

8337ce5

…Naoki)

Add Doc for new CJK option and functions

cb9812b

Fix Python build problems

54de7aa

fgallaire force-pushed the master branch from 8575fb4 to 54de7aa Compare February 15, 2017 02:44

methane reviewed Feb 15, 2017

View reviewed changes

duboviy suggested changes Mar 11, 2017

View reviewed changes

terryjreedy reviewed Apr 27, 2017

View reviewed changes

akruis pushed a commit to akruis/cpython that referenced this pull request Sep 9, 2017

merge 3.4-slp (Stackless python#81, python#89)

3c981be

JulienPalard mentioned this pull request Feb 13, 2018

bpo-24665: Add CJK support in textwrap by default. #5649

Closed

brettcannon added the awaiting changes label Feb 21, 2018

methane closed this Jul 8, 2018

alvinhochun mentioned this pull request Oct 22, 2021

Wrapping of PO file should take into account the East Asian Width (wide/fullwidth) translate/translate#4452

Closed

jaraco pushed a commit that referenced this pull request Dec 2, 2022

Update cherry_picker from 1.0.0 to 1.1.1 (GH-89)

c0393de

jaraco added a commit to jaraco/cpython that referenced this pull request Feb 17, 2023

Build docs without docutils. Closes python#89.

c95a0d8

gvanrossum mentioned this pull request Aug 22, 2023

heap-use-after-free in _PyFunction_LookupByVersion #108253

Closed

ngoldbaum mentioned this pull request Sep 23, 2024

Crash running PyO3 tests with --test-threads=1000 #124375

Closed

		return unicodedata.east_asian_width(char) in ('F', 'W')


		def cjklen(text):

		return sum(2 if cjkwide(char) else 1 for char in text)


		def cjkslices(text, index):


		.. function:: cjk_len(text)

		Return the real width of text (its len if not a string).

bpo-24665: double-width CJK chars support for textwrap #89

bpo-24665: double-width CJK chars support for textwrap #89

Conversation

fgallaire commented Feb 14, 2017 • edited Loading

the-knights-who-say-ni commented Feb 14, 2017

fgallaire commented Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgallaire Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

vstinner commented Feb 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Feb 14, 2017

fgallaire commented Feb 15, 2017 • edited Loading

fgallaire commented Feb 15, 2017 • edited Loading

codecov bot commented Feb 15, 2017

Codecov Report

methane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yan12125 commented Feb 16, 2017

yan12125 commented Feb 16, 2017

methane commented Feb 16, 2017

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fgallaire Mar 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brettcannon commented Feb 21, 2018

methane commented Jul 8, 2018

fgallaire commented Feb 14, 2017 •

edited

Loading

fgallaire commented Feb 14, 2017 •

edited

Loading

fgallaire Feb 14, 2017 •

edited

Loading

fgallaire commented Feb 15, 2017 •

edited

Loading

fgallaire commented Feb 15, 2017 •

edited

Loading

fgallaire Mar 12, 2017 •

edited

Loading