Message 321291 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	ezio.melotti, fgallaire, georg.brandl, mdk, methane, r.david.murray, serhiy.storchaka, terry.reedy, vstinner, yan12125
Date	2018-07-08.22:52:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1531090325.6.0.56676864532.issue24665@psf.upfronthosting.co.za>
In-reply-to

Content
I think that this issue should be closed, as it is based on some confusions and errors. Textwrap works in terms of characters. The wrap method "wraps the single paragraph in text (a string) so every line is at most width characters long." When the module was written, 'character' meant "printable ascii (or 'extended ascii') character". It now means 'unicode codepoint'. Both are mentally real abstractions but have no particular correspondence to physical length. Calling textwrap buggy because it works in characters is wrong. Translating 'character' to 'fixed-width character space', so that one can measure physical length in terms of 'spaces' as a physical unit, is exact if and only if all characters are displayed in the same width space. This is true for fixed pitch output devices that simulate typewriters and text that only used fixed-width characters from a fixed pitch font. For long lines, the translation for a variable-pitch fonts may or may not be good enough for a particular use. As David noted, textwrap already fails for Ascii control characters. And it does cause problem when they are used in wrapped text. They are coded with 2 or 4 characters on input and may display as 0, 1, (possibly 2), 4, or 5 characters on output, depending on the display code and the display 'device'. As to the latter, "print('x\ax')" displays as 'xx' in a Windows console and as something like xx in a tkinter Text widget, except that the numbers in the box here on Firefox are not present, so that the tk box is (sort-of) the same width as 'x'. The particular premise of this issue is that CJK characters are somehow special and that 2.x releases, and now 3.x releases, are particularly broken for CJK text. Not so. If one has text that only uses same-width characters in a fixed-pitch CJK font (including wide spaces so columns line up), then textwrap works as well as it does for any other fixed-pitch text (ie, Ascii or Latin1). If one wants lines of a particular physical width, one passes a character width argument that corresponds to the desired physical width. The following is based on what I see in IDLE's Settings dialog Font page font sample for Windows 10 'Source Code Pro'. It includes samples from 12 'alphabet's To view it, run 'python -m idlelib', and on the top menu click Options => Configure IDLE. When the selected font is not a full BMP unicode font, Tk and Windows use other fonts, scaled to the same height, to synthesize a fairly complete BMP 'font'. The [Help] text says a bit more but has a mistake. What I see: Font size corresponds to physical height. Hence, the lines are very close to the same height. Some fonts look smaller or larger because they specify more or less blank space between lines. One factor is the use of descenders, as in Arabic. Character width for a fixed height varies. 20 characters in Greek, IPA, Hebrew, and Arabic take progressive less physical space than 20 Ascii or Latin1 characters. 20 characters in Devanagari, Cyrillic, and Tamil take progressively more. (The Tamil line only has 14 chars.) None of these are obviously fixed pitch. The Chinese, Korean, and Japanese samples have a fixed pitch. The characters are not actually 'double-wide', at least no relative to most other languages. The 10 CJK characters are as wide as 16 Source Code Pro characters. To match the physical width of 72 Ascii spaces, one should pass 'width=45'. But note that the exact ratio for Ascii depends on the font. It is a little higher for Courier and Lucida Console. It ranges from about 1 (for Arabic) to 2 (for Tamil) for other languages. The first 10 Tamil characters are slightly wider than the 10 CJK characters, so counting each CJK character as two average Tamil character is completely wrong. My conclusion: the proposal is unnecessary for pure CJK text; it is wrong in hard-coding a fix only for CJK; the CJK fix is wrong in hard-coding a particular ratio, in particular, one that is at the extreme end of the range of possibilities. Therefore, I think the open PR should be closed. I also think this issue should be closed in favor of #12499, which proposed to allow users to pass a transform function suitable for their particular use case. If that is implemented, and we decide to then add some sample functions, or rather, function factories, and to include one specifically for CJK, then a new PR will be needed, and a new issue would be appropriate. A more generic function factory for text with characters of two width classes might have as inputs a condition to identify '2nd language characters' and their fixed or average width relative relative to the 'first' language.

I think that this issue should be closed, as it is based on some confusions and errors.

Textwrap works in terms of characters. The wrap method "wraps the single paragraph in text (a string) so every line is at most width characters long." When the module was written, 'character' meant "printable ascii (or 'extended ascii') character". It now means 'unicode codepoint'. Both are mentally real abstractions but have no particular correspondence to physical length. Calling textwrap buggy because it works in characters is wrong.

Translating 'character' to 'fixed-width character space', so that one can measure physical length in terms of 'spaces' as a physical unit, is exact if and only if all characters are displayed in the same width space. This is true for fixed pitch output devices that simulate typewriters and text that only used fixed-width characters from a fixed pitch font. For long lines, the translation for a variable-pitch fonts may or may not be good enough for a particular use.

As David noted, textwrap already fails for Ascii control characters. And it does cause problem when they are used in wrapped text. They are coded with 2 or 4 characters on input and may display as 0, 1, (possibly 2), 4, or 5 characters on output, depending on the display code and the display 'device'. As to the latter, "print('x\ax')" displays as 'xx' in a Windows console and as something like xx in a tkinter Text widget, except that the numbers in the box here on Firefox are not present, so that the tk box is (sort-of) the same width as 'x'.

The particular premise of this issue is that CJK characters are somehow special and that 2.x releases, and now 3.x releases, are particularly broken for CJK text. Not so. If one has text that only uses same-width characters in a fixed-pitch CJK font (including wide spaces so columns line up), then textwrap works as well as it does for any other fixed-pitch text (ie, Ascii or Latin1). If one wants lines of a particular physical width, one passes a character width argument that corresponds to the desired physical width.

The following is based on what I see in IDLE's Settings dialog Font page font sample for Windows 10 'Source Code Pro'. It includes samples from 12 'alphabet's To view it, run 'python -m idlelib', and on the top menu click Options => Configure IDLE. When the selected font is not a full BMP unicode font, Tk and Windows use other fonts, scaled to the same height, to synthesize a fairly complete BMP 'font'. The [Help] text says a bit more but has a mistake. What I see:

Font size corresponds to physical height. Hence, the lines are very close to the same height. Some fonts look smaller or larger because they specify more or less blank space between lines. One factor is the use of descenders, as in Arabic.

Character width for a fixed height varies. 20 characters in Greek, IPA, Hebrew, and Arabic take progressive less physical space than 20 Ascii or Latin1 characters. 20 characters in Devanagari, Cyrillic, and Tamil take progressively more. (The Tamil line only has 14 chars.) None of these are obviously fixed pitch.

The Chinese, Korean, and Japanese samples have a fixed pitch. The characters are *not* actually 'double-wide', at least no relative to most other languages. The 10 CJK characters are as wide as 16 Source Code Pro characters. To match the physical width of 72 Ascii spaces, one should pass 'width=45'.

But note that the exact ratio for Ascii depends on the font. It is a little higher for Courier and Lucida Console. It ranges from about 1 (for Arabic) to 2 (for Tamil) for other languages. The first 10 Tamil characters are slightly wider than the 10 CJK characters, so counting each CJK character as two average Tamil character is completely wrong.

My conclusion: the proposal is unnecessary for pure CJK text; it is wrong in hard-coding a fix only for CJK; the CJK fix is wrong in hard-coding a particular ratio, in particular, one that is at the extreme end of the range of possibilities. Therefore, I think the open PR should be closed. I also think this issue should be closed in favor of #12499, which proposed to allow users to pass a transform function suitable for their particular use case. If that is implemented, and we decide to then add some sample functions, or rather, function factories, and to include one specifically for CJK*, then a new PR will be needed, and a new issue would be appropriate.

* A more generic function factory for text with characters of two width classes might have as inputs a condition to identify '2nd language characters' and their fixed or average width relative relative to the 'first' language.

History
Date	User	Action	Args
2018-07-08 22:52:05	terry.reedy	set	recipients: + terry.reedy, georg.brandl, vstinner, ezio.melotti, r.david.murray, methane, fgallaire, serhiy.storchaka, yan12125, mdk
2018-07-08 22:52:05	terry.reedy	set	messageid: <1531090325.6.0.56676864532.issue24665@psf.upfronthosting.co.za>
2018-07-08 22:52:05	terry.reedy	link	issue24665 messages
2018-07-08 22:52:05	terry.reedy	create