Issue 24665: CJK support for textwrap

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68853

classification

Title:	CJK support for textwrap
Type:	enhancement	Stage:	resolved
Components:	Library (Lib), Unicode	Versions:	Python 3.8

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, fgallaire, georg.brandl, mdk, methane, r.david.murray, serhiy.storchaka, terry.reedy, vstinner, xi2, yan12125
Priority:	normal	Keywords:	patch

Created on 2015-07-19 02:44 by fgallaire, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
CJK.patch	fgallaire, 2015-07-19 02:44		review
CJK+fix.patch	fgallaire, 2015-09-13 13:57		review

Pull Requests
URL	Status	Linked	Edit
PR 89	closed	fgallaire, 2017-02-14 05:52
PR 5649	closed	mdk, 2018-02-13 01:07
PR 5649	closed	mdk, 2018-02-13 01:07
PR 28136	open	xi2, 2021-09-03 08:13

Messages (24)
msg246930 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-07-19 03:28
This is new feature and can be added only in 3.6. Issue12499 looks related. See also issue12568.
msg246932 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2015-07-19 07:13
Bad wrapping of CJK chars is a bug. I don't understand why Python2 should be broken forever!
msg247001 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-07-20 19:20
Because to get proper unicode support, we wrote python3, and because handling anything other than single-character-width characters in textwrap is a new feature.
msg247008 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2015-07-21 00:03
FUD about Python here is something I wasn't expecting. Python 2 supports Unicode and is still used a lot by a lot of people. CJK people are not subhumans, so don't support CJK is something called, wait... a bug ! And it's a shame that it was not fixed earlier. Python 3 has this bug too, so it's not really what I would call a "proper unicode support".
msg247009 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-07-21 01:13
The problem is (if I'm understanding this correctly, which I may not be, I'm not a unicode expert) is that how you compute and manipulate CJK characters in python2 differs depending on whether you are dealing with a wide build or a narrow build. And the fact that python3 doesn't handle it either is why this would be a new feature (see the referenced issues). But I could be wrong. I leave it to the unicode experts.
msg247016 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-07-21 06:18
What textwrap does not take into account the width of characters is not the only problem. It also does not take into account combining characters and control codes. Implementing all this will significantly complicate the code and possibly should lie outside the standard library. Such large change obviously is new feature. I believe that we should first provide a common interface to determine the width of the line (issue12499) and allow to determine the appropriate algorithm at the application level. Also provide helper functions like in issue12568.
msg247022 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2015-07-21 09:17
If your unicode experts haven't fix this BUG still now, this will never be done (by this experts). We can say they are not true unicode experts as they have forgotten since a so long time billions of CJK people !
msg247025 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-07-21 12:42
Or perhaps we haven't had a CJK user interested enough in using textwrap to provide the needed enhancements. It seems like there is interest in solving the related problems recently, so perhaps some progress will be made now. The fact that you view it as a bug does not mean that it is a bug from the point of view of the python project. Every enhancement could be considered a bug, but we have a strict backward compatibility policy for maintenance release (which python 2.7.x is) which follows the semantic versioning principle that the API does not change in a micro release. According to Serhiy, fixing this correctly will require API changes.
msg276275 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2016-09-13 14:00
CJKwrap a little lib to fix this bug: https://github.com/fgallaire/cjkwrap
msg284068 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2016-12-27 07:03
Hello everybody, This is a Python3 version of my lib CJKwrap: https://github.com/fgallaire/cjkwrap3 It could be integrated as a new lib in Python 3.7. People who are using textwrap will have no surprise, and people who wants CJK width support will be happy to use it. Any remarks welcome. Cheers
msg287736 - (view)	Author: Florent Gallaire (fgallaire) *	Date: 2017-02-14 05:52
After discussion with Haypo, CJK support is now implemented as and option, disabled by default for backward compatibility reasons. PR on GitHub: https://github.com/python/cpython/pull/89
msg287753 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-14 10:01
Florent Gallaire: Since Python 3.0 was released in 2008, new features are no more accepted in Python 2 (Python 2.7, the last release of Python 2). It's a deliberate choice, mostly motivated by the lack of CPython developers. See also the PEP 404. I'm only talking about Python core and its builtin stdlib: it became very easy to extend Python with third party modules. We even added ensurepip to Python 2.7.9, even if we didn't want to add new features to Python 2. So just create a module on PyPI as you did, make it working on Python 2.7+ and you are done ;-) In 2016, INADA Naoki and Xiang Zhang got promoted to Python core developers: Naoki is japanese and Xiang is chinese. Maybe they would help on CJK issues. Please remind that Python core developers are volunteers contributing to Python in their free time. Have a nice day :-)
msg287756 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-02-14 10:06
See also the now old issue #12568: "Add functions to get the width in columns of a character".
msg287820 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-15 06:23
Sorry, I'm not unicode expert. Important usage of textwrap is printing in terminal. So I think we should learn from software relating terminal. tmux uses utf8proc. utf8proc calculates display width by script at here. https://github.com/JuliaLang/utf8proc/tree/master/data
msg287821 - (view)	Author: (yan12125) *	Date: 2017-02-15 06:30
Some CJK character are marked as "ambiguous width". Seems in this patch ambiguous characters are assumed as narrow. Maybe it's better to document it?
msg287822 - (view)	Author: Inada Naoki (methane) *	Date: 2017-02-15 06:55
FYI, I had implemented textwrap respects EAW in Bazaar project. See here. http://bazaar.launchpad.net/~bzr-pqm/bzr/bzr.dev/revision/5874
msg288785 - (view)	Author: Inada Naoki (methane) *	Date: 2017-03-02 08:25
See also http://www.unicode.org/reports/tr29/ http://www.unicode.org/reports/tr14/
msg297563 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-07-03 11:16
I remove the dependency to bpo-12568 since the current PR uses unicodedata.east_asian_width(), not the C function wcswidth().
msg312076 - (view)	Author: Julien Palard (mdk) *	Date: 2018-02-12 20:48
I reread issue6755, issue12485, issue12499, and issue12568 about the textwrap/char width topic, and went to those conclusions: - It's a hard topic [1][2], so we may not succeed in a single shot. - The work is already done by wcwidth in the libc, which does not exists on windows and may not exist on MacOS. - The work is already partially done for CJK in the Unicode standard. So I'm OK with adding CJK support to textwrap as a first step, which means I'm not ok with the CJK parameter to the wrap function, as maybe in the future we'll do more. I'm also not OK with this being not the default, as many code using textwrap do not know in advance if they need CJK, they should not care, it should be done right by default. But having CJK support by default also mean we'll have to fallback silently to a non-unicode textwrap if unicodedata is not available, as Victor said in PR-89 "Python requires optparse to compile modules like unicodedata, optparse imports textwrap which now always requires unicodedata", which may or may not lead to surprising behavior. [1] The unicode standard is not clear about character width (not its role, font dependent), only where it's OK to put the line break. [2] CJK is not enough, there's plenty of other characters of width not equal to one, like combining characters, tabulations, non-printables, Prepended_Concatenation_Mark having a typical width of zero but sometimes 1, U+00AD being tricky, Hangul Jamo medial vowels and final consonants are "conjoining", and so on... this need a huge effort / attention, this if we're going down this hole it means a lot of maintenance / new issues about this or that character being reported with the "wrong" size.
msg321283 - (view)	Author: Inada Naoki (methane) *	Date: 2018-07-08 18:25
I'm not expert of this area, but Korean support is still totally broken: >>> import unicodedata >>> s = "\u1100\u1161\u11a8" >>> unicodedata.east_asian_width(s[0]) 'W' >>> unicodedata.east_asian_width(s[1]) 'N' >>> unicodedata.east_asian_width(s[2]) 'N' >>> s '각' >>> s[:2] '가' >>> print(s[:2] + '\n' + s[2:]) 가 ᆨ >>> print(s[:1] + '\n' + s[1:]) ᄀ ᅡᆨ I think "CJK support" is not nice naming for just using east_asian_width. PR-5649 doesn't use "CJK" word in APIs, but uses in commit log (PR title) and NEWS entry. Maybe, it should be "Textwrap now uses ``unicodedata.east_asian_width()`` to calculate text width when unicodedata is available."
msg321284 - (view)	Author: Julien Palard (mdk) *	Date: 2018-07-08 18:32
Hi Inada, you're right and that's more or less why I not used CJK in the implementation: mainly I don't want to close the door to future enhancements on this topic (char width) of non-CJK languages (like those "invisible hyphens" here only to tell textwrapers where they can wrap inside of words, and so on). I'll remove CJK mentions (sadly maybe not today).
msg321291 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-07-08 22:52
I think that this issue should be closed, as it is based on some confusions and errors. Textwrap works in terms of characters. The wrap method "wraps the single paragraph in text (a string) so every line is at most width characters long." When the module was written, 'character' meant "printable ascii (or 'extended ascii') character". It now means 'unicode codepoint'. Both are mentally real abstractions but have no particular correspondence to physical length. Calling textwrap buggy because it works in characters is wrong. Translating 'character' to 'fixed-width character space', so that one can measure physical length in terms of 'spaces' as a physical unit, is exact if and only if all characters are displayed in the same width space. This is true for fixed pitch output devices that simulate typewriters and text that only used fixed-width characters from a fixed pitch font. For long lines, the translation for a variable-pitch fonts may or may not be good enough for a particular use. As David noted, textwrap already fails for Ascii control characters. And it does cause problem when they are used in wrapped text. They are coded with 2 or 4 characters on input and may display as 0, 1, (possibly 2), 4, or 5 characters on output, depending on the display code and the display 'device'. As to the latter, "print('x\ax')" displays as 'xx' in a Windows console and as something like xx in a tkinter Text widget, except that the numbers in the box here on Firefox are not present, so that the tk box is (sort-of) the same width as 'x'. The particular premise of this issue is that CJK characters are somehow special and that 2.x releases, and now 3.x releases, are particularly broken for CJK text. Not so. If one has text that only uses same-width characters in a fixed-pitch CJK font (including wide spaces so columns line up), then textwrap works as well as it does for any other fixed-pitch text (ie, Ascii or Latin1). If one wants lines of a particular physical width, one passes a character width argument that corresponds to the desired physical width. The following is based on what I see in IDLE's Settings dialog Font page font sample for Windows 10 'Source Code Pro'. It includes samples from 12 'alphabet's To view it, run 'python -m idlelib', and on the top menu click Options => Configure IDLE. When the selected font is not a full BMP unicode font, Tk and Windows use other fonts, scaled to the same height, to synthesize a fairly complete BMP 'font'. The [Help] text says a bit more but has a mistake. What I see: Font size corresponds to physical height. Hence, the lines are very close to the same height. Some fonts look smaller or larger because they specify more or less blank space between lines. One factor is the use of descenders, as in Arabic. Character width for a fixed height varies. 20 characters in Greek, IPA, Hebrew, and Arabic take progressive less physical space than 20 Ascii or Latin1 characters. 20 characters in Devanagari, Cyrillic, and Tamil take progressively more. (The Tamil line only has 14 chars.) None of these are obviously fixed pitch. The Chinese, Korean, and Japanese samples have a fixed pitch. The characters are not actually 'double-wide', at least no relative to most other languages. The 10 CJK characters are as wide as 16 Source Code Pro characters. To match the physical width of 72 Ascii spaces, one should pass 'width=45'. But note that the exact ratio for Ascii depends on the font. It is a little higher for Courier and Lucida Console. It ranges from about 1 (for Arabic) to 2 (for Tamil) for other languages. The first 10 Tamil characters are slightly wider than the 10 CJK characters, so counting each CJK character as two average Tamil character is completely wrong. My conclusion: the proposal is unnecessary for pure CJK text; it is wrong in hard-coding a fix only for CJK; the CJK fix is wrong in hard-coding a particular ratio, in particular, one that is at the extreme end of the range of possibilities. Therefore, I think the open PR should be closed. I also think this issue should be closed in favor of #12499, which proposed to allow users to pass a transform function suitable for their particular use case. If that is implemented, and we decide to then add some sample functions, or rather, function factories, and to include one specifically for CJK, then a new PR will be needed, and a new issue would be appropriate. A more generic function factory for text with characters of two width classes might have as inputs a condition to identify '2nd language characters' and their fixed or average width relative relative to the 'first' language.
msg321304 - (view)	Author: Inada Naoki (methane) *	Date: 2018-07-09 08:58
Thanks, Terry. I have doubt about "east asian width" support should be merged. While I agree it is "better than nothing", it may make textwrap much slower. But I didn't have courage to reject the PR. Textwrap focused on ASCII and English-like (space separated) language. "support unicode" is very hard problem. We should consider about grapheme cluster (UAX29), east asian width (UAX11) (But utf8proc is better than UAX11), and line breaking algorithm (UAX14). For wrapping text on terminal, some terminal emulators and multiplexers (e.g. tmux) implements much nicer algorithm. I think 3rd party C extension based on algorithm used by tmux is best solution. If someone really want this feature, please try it on PyPI. I understand "want it in stdlib!". But text wrapping is very hard, complicated problem. Since stdlib grows slower, and backward compatibility restrict us, I think it should be implemented in 3rd party library first.
msg323733 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-08-18 20:59
My msg323731 for #12568 refers to my msg321291 above. I did some new experiments with column spacing for European characters in Windows console (as opposed to tk Text) and discovered that some, including some Latin1 characters used in English text, may also become double-width in certain fonts. The problem of calculating physical line lengths is even harder than I thought before, and does not require non-English text.

History
Date	User	Action	Args
2022-04-11 14:58:19	admin	set	github: 68853
2021-09-03 08:13:33	xi2	set	nosy: + xi2 pull_requests: + pull_request26575
2018-08-18 20:59:18	terry.reedy	set	messages: + msg323733
2018-07-09 08:58:24	methane	set	status: open -> closed messages: + msg321304 dependencies: - textwrap.wrap: add control for fonts with different character widths resolution: rejected stage: patch review -> resolved
2018-07-08 22:52:05	terry.reedy	set	nosy: + terry.reedy messages: + msg321291 title: Use unicodedata.east_asian_width in textwrap -> CJK support for textwrap
2018-07-08 18:32:07	mdk	set	messages: + msg321284
2018-07-08 18:31:02	methane	set	title: CJK support for textwrap -> Use unicodedata.east_asian_width in textwrap versions: + Python 3.8, - Python 3.7
2018-07-08 18:25:55	methane	set	messages: + msg321283
2018-02-13 01:07:48	mdk	set	stage: patch review pull_requests: + pull_request5451
2018-02-13 01:07:41	mdk	set	stage: (no value) pull_requests: + pull_request5450
2018-02-12 20:48:33	mdk	set	nosy: + mdk messages: + msg312076
2017-07-03 11:17:10	pitrou	set	nosy: - pitrou
2017-07-03 11:16:09	vstinner	set	dependencies: - Add functions to get the width in columns of a character messages: + msg297563
2017-03-02 08:25:02	methane	set	messages: + msg288785
2017-02-15 07:44:17	serhiy.storchaka	set	dependencies: + textwrap.wrap: add control for fonts with different character widths, Add functions to get the width in columns of a character
2017-02-15 06:55:27	methane	set	messages: + msg287822
2017-02-15 06:30:38	yan12125	set	nosy: + yan12125 messages: + msg287821
2017-02-15 06:23:34	methane	set	nosy: + methane messages: + msg287820
2017-02-14 10:06:59	vstinner	set	messages: + msg287756
2017-02-14 10:01:02	vstinner	set	messages: + msg287753
2017-02-14 05:52:56	fgallaire	set	messages: + msg287736 pull_requests: + pull_request56
2016-12-27 07:03:44	fgallaire	set	messages: + msg284068 versions: + Python 3.7, - Python 3.6
2016-09-13 14:00:47	fgallaire	set	messages: + msg276275
2015-09-13 13:57:35	fgallaire	set	files: + CJK+fix.patch
2015-07-21 12:42:58	r.david.murray	set	messages: + msg247025
2015-07-21 09:17:07	fgallaire	set	messages: + msg247022
2015-07-21 06:18:28	serhiy.storchaka	set	messages: + msg247016
2015-07-21 01:13:28	r.david.murray	set	nosy: + ezio.melotti, vstinner messages: + msg247009 components: + Unicode
2015-07-21 00:03:21	fgallaire	set	messages: + msg247008
2015-07-20 19:20:01	r.david.murray	set	nosy: + r.david.murray messages: + msg247001 versions: + Python 3.6, - Python 2.7
2015-07-19 07:13:12	fgallaire	set	messages: + msg246932 versions: + Python 2.7, - Python 3.6
2015-07-19 03:28:30	serhiy.storchaka	set	nosy: + georg.brandl, serhiy.storchaka, pitrou messages: + msg246930 versions: + Python 3.6, - Python 2.7
2015-07-19 02:44:39	fgallaire	create