Title: CJK support for textwrap
Type: enhancement Stage: patch review
Components: Library (Lib), Unicode Versions: Python 3.7
Status: open Resolution:
Dependencies: 12499 Superseder:
Assigned To: Nosy List: ezio.melotti, fgallaire, georg.brandl, inada.naoki, mdk, r.david.murray, serhiy.storchaka, vstinner, yan12125
Priority: normal Keywords: patch

Created on 2015-07-19 02:44 by fgallaire, last changed 2018-02-13 01:07 by mdk.

File name Uploaded Description Edit
CJK.patch fgallaire, 2015-07-19 02:44 review
CJK+fix.patch fgallaire, 2015-09-13 13:57 review
Pull Requests
URL Status Linked Edit
PR 89 open fgallaire, 2017-02-14 05:52
PR 5649 open mdk, 2018-02-13 01:07
PR 5649 open mdk, 2018-02-13 01:07
Messages (19)
msg246930 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-07-19 03:28
This is new feature and can be added only in 3.6.

Issue12499 looks related. See also issue12568.
msg246932 - (view) Author: Florent Gallaire (fgallaire) * Date: 2015-07-19 07:13
Bad wrapping of CJK chars is a bug.
I don't understand why Python2 should be broken forever!
msg247001 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-07-20 19:20
Because to get proper unicode support, we wrote python3, and because handling anything other than single-character-width characters in textwrap is a new feature.
msg247008 - (view) Author: Florent Gallaire (fgallaire) * Date: 2015-07-21 00:03
FUD about Python here is something I wasn't expecting.

Python 2 supports Unicode and is still used a lot by a lot of people.

CJK people are not subhumans, so don't support CJK is something called, wait... a bug ! And it's a shame that it was not fixed earlier.

Python 3 has this bug too, so it's not really what I would call a "proper unicode support".
msg247009 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-07-21 01:13
The problem is (if I'm understanding this correctly, which I may not be, I'm not a unicode expert) is that how you compute and manipulate CJK characters in python2 differs depending on whether you are dealing with a wide build or a narrow build.  And the fact that python3 doesn't handle it either is why this would be a new feature (see the referenced issues).

But I could be wrong.  I leave it to the unicode experts.
msg247016 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-07-21 06:18
What textwrap does not take into account the width of characters is not the only problem. It also does not take into account combining characters and control codes. Implementing all this will significantly complicate the code and possibly should lie outside the standard library. Such large change obviously is new feature. I believe that we should first provide a common interface to determine the width of the line (issue12499) and allow to determine the appropriate algorithm at the application level. Also provide helper functions like in issue12568.
msg247022 - (view) Author: Florent Gallaire (fgallaire) * Date: 2015-07-21 09:17
If your unicode experts haven't fix this BUG still now, this will never be done (by this experts).

We can say they are not true unicode experts as they have forgotten since a so long time billions of CJK people !
msg247025 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-07-21 12:42
Or perhaps we haven't had a CJK user interested enough in using textwrap to provide the needed enhancements.  It seems like there is interest in solving the related problems recently, so perhaps some progress will be made now.

The fact that you view it as a bug does not mean that it is a bug from the point of view of the python project.  Every enhancement could be considered a bug, but we have a strict backward compatibility policy for maintenance release (which python 2.7.x is) which follows the semantic versioning principle that the API does not change in a micro release.  According to Serhiy, fixing this correctly will require API changes.
msg276275 - (view) Author: Florent Gallaire (fgallaire) * Date: 2016-09-13 14:00
CJKwrap a little lib to fix this bug:
msg284068 - (view) Author: Florent Gallaire (fgallaire) * Date: 2016-12-27 07:03
Hello everybody,

This is a Python3 version of my lib CJKwrap:

It could be integrated as a new lib in Python 3.7.

People who are using textwrap will have no surprise, and people who wants CJK width support will be happy to use it.

Any remarks welcome.

msg287736 - (view) Author: Florent Gallaire (fgallaire) * Date: 2017-02-14 05:52
After discussion with Haypo, CJK support is now implemented as and option, disabled by default for backward compatibility reasons.

PR on GitHub:
msg287753 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-14 10:01
Florent Gallaire: Since Python 3.0 was released in 2008, new features are no more accepted in Python 2 (Python 2.7, the last release of Python 2). It's a deliberate choice, mostly motivated by the lack of CPython developers. See also the PEP 404.

I'm only talking about Python core and its builtin stdlib: it became very easy to extend Python with third party modules. We even added ensurepip to Python 2.7.9, even if we didn't want to add new features to Python 2. So just create a module on PyPI as you did, make it working on Python 2.7+ and you are done ;-)

In 2016, INADA Naoki and Xiang Zhang got promoted to Python core developers: Naoki is japanese and Xiang is chinese. Maybe they would help on CJK issues.

Please remind that Python core developers are volunteers contributing to Python in their free time.

Have a nice day :-)
msg287756 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-14 10:06
See also the now old issue #12568: "Add functions to get the width in columns of a character".
msg287820 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-15 06:23
Sorry, I'm not unicode expert.

Important usage of textwrap is printing in terminal.
So I think we should learn from software relating terminal.

tmux uses utf8proc.  utf8proc calculates display width by script at here.
msg287821 - (view) Author: Chih-Hsuan Yen (yan12125) * Date: 2017-02-15 06:30
Some CJK character are marked as "ambiguous width". Seems in this patch ambiguous characters are assumed as narrow. Maybe it's better to document it?
msg287822 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-02-15 06:55
FYI, I had implemented textwrap respects EAW in Bazaar project.
See here.
msg288785 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-02 08:25
See also
msg297563 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-07-03 11:16
I remove the dependency to bpo-12568 since the current PR uses unicodedata.east_asian_width(), not the C function wcswidth().
msg312076 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2018-02-12 20:48
I reread issue6755, issue12485, issue12499, and issue12568 about the textwrap/char width topic, and went to those conclusions:

- It's a hard topic [1][2], so we may not succeed in a single shot.
- The work is already done by wcwidth in the libc, which does not exists on windows and may not exist on MacOS.
- The work is already partially done for CJK in the Unicode standard.

So I'm OK with adding CJK support to textwrap *as a first step*, which means I'm *not* ok with the CJK parameter to the wrap function, as maybe in the future we'll do more.

I'm also not OK with this being not the default, as many code using textwrap do not know in advance if they need CJK, they should not care, it should be done right by default.

But having CJK support by default also mean we'll have to fallback silently to a non-unicode textwrap if unicodedata is not available, as Victor said in PR-89 "Python requires optparse to compile modules like unicodedata, optparse imports textwrap which now always requires unicodedata", which may or may not lead to surprising behavior.

[1] The unicode standard is not clear about character width (not its role, font dependent), only where it's OK to put the line break.
[2] CJK is not enough, there's plenty of other characters of width not equal to one, like combining characters, tabulations, non-printables, Prepended_Concatenation_Mark having a typical width of zero but sometimes 1, U+00AD being tricky, Hangul Jamo medial vowels and final consonants are "conjoining", and so on... this need a huge effort / attention, this if we're going down this hole it means a lot of maintenance / new issues about this or that character being reported with the "wrong" size.
Date User Action Args
2018-02-13 01:07:48mdksetstage: patch review
pull_requests: + pull_request5451
2018-02-13 01:07:41mdksetstage: (no value)
pull_requests: + pull_request5450
2018-02-12 20:48:33mdksetnosy: + mdk
messages: + msg312076
2017-07-03 11:17:10pitrousetnosy: - pitrou
2017-07-03 11:16:09vstinnersetdependencies: - Add functions to get the width in columns of a character
messages: + msg297563
2017-03-02 08:25:02inada.naokisetmessages: + msg288785
2017-02-15 07:44:17serhiy.storchakasetdependencies: + textwrap.wrap: add control for fonts with different character widths, Add functions to get the width in columns of a character
2017-02-15 06:55:27inada.naokisetmessages: + msg287822
2017-02-15 06:30:38yan12125setnosy: + yan12125
messages: + msg287821
2017-02-15 06:23:34inada.naokisetnosy: + inada.naoki
messages: + msg287820
2017-02-14 10:06:59vstinnersetmessages: + msg287756
2017-02-14 10:01:02vstinnersetmessages: + msg287753
2017-02-14 05:52:56fgallairesetmessages: + msg287736
pull_requests: + pull_request56
2016-12-27 07:03:44fgallairesetmessages: + msg284068
versions: + Python 3.7, - Python 3.6
2016-09-13 14:00:47fgallairesetmessages: + msg276275
2015-09-13 13:57:35fgallairesetfiles: + CJK+fix.patch
2015-07-21 12:42:58r.david.murraysetmessages: + msg247025
2015-07-21 09:17:07fgallairesetmessages: + msg247022
2015-07-21 06:18:28serhiy.storchakasetmessages: + msg247016
2015-07-21 01:13:28r.david.murraysetnosy: + ezio.melotti, vstinner
messages: + msg247009
components: + Unicode
2015-07-21 00:03:21fgallairesetmessages: + msg247008
2015-07-20 19:20:01r.david.murraysetnosy: + r.david.murray

messages: + msg247001
versions: + Python 3.6, - Python 2.7
2015-07-19 07:13:12fgallairesetmessages: + msg246932
versions: + Python 2.7, - Python 3.6
2015-07-19 03:28:30serhiy.storchakasetnosy: + georg.brandl, serhiy.storchaka, pitrou

messages: + msg246930
versions: + Python 3.6, - Python 2.7
2015-07-19 02:44:39fgallairecreate