classification
Title: argparse calculates string widths incorrectly
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Vanessa McHale, steven.daprano, terry.reedy
Priority: normal Keywords:

Created on 2017-06-09 05:56 by Vanessa McHale, last changed 2017-06-09 21:14 by terry.reedy. This issue is now closed.

Messages (4)
msg295490 - (view) Author: Vanessa McHale (Vanessa McHale) Date: 2017-06-09 05:56
Currently, python computes string widths based on number of characters. However, this will not work in general because some languages have e.g. vowel markers: 'བོད' for instance is three characters but its width should be two. 

I have an example repo here: https://github.com/vmchale/argparse-min-example
msg295513 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-06-09 11:03
I don't really understand your example code. What result did you expect? The output shown in Github seems correct to me:

optional arguments:
  -h, --help            show this help message and exit
  --language1 XXXXXXXXXX
                        Lanugage for output
  --language2 LANGUAGE  Lanugage for output


I've substituted "X" for the "missing characters" that show up, as I don't have a Tibetan font installed.

This is a more complicated "bug" (feature?) than it might seem, and I don't think it is really an argparse issue so much as a string issue. The length of a string is the number of code points in it, without trying to distinguish zero-width code points and combining characters from the rest.

I don't believe that argparse has any way of knowing how the string will be displayed. It could be displayed as:

- a series of 10 "missing character" square glyphs; 
- the correct glyphs, but still 10 columns wide (if the font has glyphs for Tibetan, but does not render the vowel markers correctly);
- or it might render the text properly, according to the rules for Tibetan, requiring less than 10 (I guess) columns.

I believe that, unfortunately, the only way that those three scenarios can be distinguished would be to print the text to a GUI framework with a rich text widget capable of measuring the *width* of text in pixels.

Working in a console app, as argparse does, it is limited to the typefaces the console supports, and cannot get the pixel width. I think the only safe way to proceed is to count code points (i.e. the length as reported by Python strings) and assume each code point requires one column. That way you can be reasonably confident that the string won't be any more than that number of columns wide.

(Even that might be wrong, if the string includes full width Asian code points, which may take two columns each.)

I don't think there is any good solution here, but I think the status quo might be the least worst. If argparse assumes that the vowel markers are zero-width, it will format the output correctly

optional arguments:
  -h, --help            show this help message and exit
  --language1 XXXXXXX   Lanugage for output
  --language2 LANGUAGE  Lanugage for output


but only for those who have the correct Tibetan typeface installed. Everyone else will see:

optional arguments:
  -h, --help            show this help message and exit
  --language1 XXXXXXXXXX   Lanugage for output
  --language2 LANGUAGE  Lanugage for output


(By the way, I'm guessing what the output might be -- I don't know Tibetan and don't know how many columns the correctly displayed string will take.)

Vanessa, if my analysis is wrong in any way, or if you can think of a patch to argparse that will solve this issue, please tell us.

Otherwise, I think this has to be treated as "won't fix".
msg295514 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-06-09 11:07
By the way, perhaps a simpler demonstration which is more likely to render correctly on most people's systems would be to use Latin-1 combining characters:

py> s1 = 'àéîõü'
py> s2 = unicodedata.normalize('NFD', s1) # decompose into combining chars
py> s1, s2
('àéîõü', 'àéîõü')
py> assert len(s1) == 5 and len(s2) == 10
py>
msg295581 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-06-09 21:14
For future reference, small code examples should be including in the message or uploaded as a .py file.

A unicode string is a sequence of codepoints.  The length is defined as the number of codepoints.  I cannot see that your example demonstrates a bug in argparse.
History
Date User Action Args
2017-06-09 21:14:07terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg295581

resolution: not a bug
stage: resolved
2017-06-09 11:07:23steven.dapranosetmessages: + msg295514
2017-06-09 11:03:01steven.dapranosetnosy: + steven.daprano
messages: + msg295513
2017-06-09 05:56:43Vanessa McHalecreate