classification
Title: Add functions to get the width in columns of a character
Type: enhancement Stage: patch review
Components: Unicode Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Nicholas.Cole, benjamin.peterson, eric.araujo, ezio.melotti, haypo, inigoserna, lemburg, loewis, poq, serhiy.storchaka, tchrist, terry.reedy, zeha
Priority: normal Keywords: patch

Created on 2011-07-14 22:43 by haypo, last changed 2013-02-02 09:05 by terry.reedy.

Files
File name Uploaded Description Edit
locale_width.patch haypo, 2011-10-14 01:28 review
width.py loewis, 2012-03-10 14:33
Messages (29)
msg140376 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-07-14 22:43
Some characters take more than one column in a terminal, especially CJK (chinese, japanese, korean) characters. If you use such character in a terminal without taking care of the width in columns of each character, the text alignment can be broken. Issue #2382 is an example of this problem.

#2382 and #6755 have patches implementing such function:
- unicode_width.patch of #2382 adds unicode.width() method
- ucs2w.c of #6755 creates a new ucs2w module with two functions: unichr2w() (width of a character) and ucs2w() (width of a string)

Use test_ucs2w.py of #6755 to test these new functions/methods.
msg140488 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-07-16 10:17
In the #2382 code, how is the Windows case supposed to work? Also, what about systems that don't have wcswidth? IOW, the patch appears to be incorrect.

I like the #6755 approach better, except that it shouldn't be using hard-coded tables, but instead integrate with Python's version of the UCD. In addition, it should use an accepted, published strategy for determining the width, preferably coming from the Unicode consortium.
msg141936 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-12 02:28
I can attest that being able to get the columns of a grapheme cluster is very important for printing, because you need this to do correct linebreaking.  There might be something you can steal from 

   http://search.cpan.org/perldoc?Unicode::GCString
   http://search.cpan.org/perldoc?Unicode::LineBreak

which implements UAX#14 on linebreaking and UAX#11 on East Asian widths.  

I use this in my own code to help format Unicode strings my columns or lines.  The right way would be to build this sort of knowledge into string.format(), but that is much harder, so an intermediary library module seems good enough for now.
msg145497 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-14 01:28
> There might be something you can steal from  ...

I don't think that Python should reinvent the wheel. We should just reuse wcswidth().

Here is a simple patch exposing wcswidth() function as locale.width().

Example:

>>> import locale
>>> text = '\u3042\u3044\u3046\u3048\u304a'
>>> len(text)
5
>>> locale.width(text)
10
>>> locale.width(' ')
1
>>> locale.width('\U0010abcd')
1
>>> locale.width('\uDC80')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
locale.Error: the string is not printable
>>> locale.width('\U0010FFFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
locale.Error: the string is not printable

I don't think that we need locale.width() on Windows because its console has already bigger issues with Unicode: see issue #1602. If you want to display correctly non-ASCII characters on Windows, just avoid the Windows console and use a graphical widget.
msg145498 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-14 01:32
Oh, unicode_width.patch of issue #2382 implements the width on Windows using:

WideCharToMultiByte(CP_ACP, 0, buf, len, NULL, 0, NULL, NULL);

It computes the length of byte string encoded to the ANSI code page. I don't know if it can be seen as the "width" of a character string in the console...
msg145523 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-14 14:56
I think the WideCharToMultibyte approach is just incorrect.

I'm -1 on using wcswidth, though. We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/ 
The outcomes of this function are these:
- F: full-width, width 2, compatibility character for a narrow char
- H: half-width, width 1, compatibility character for a narrow char
- W: wide, width 2
- Na: narrow, width 1
- A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
- N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1
msg145535 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-14 15:33
> Martin v. Löwis <martin@v.loewis.de> added the comment:

> I think the WideCharToMultibyte approach is just incorrect.

> I'm -1 on using wcswidth, though. 

Like you, I too seriously question using wcswidth() for this at all:

    The wcswidth() function either shall return 0 (if pwcs points to a
    null wide-character code), or return the number of column positions
    to be occupied by the wide-character string pointed to by pwcs, or
    return -1 (if any of the first n wide-character codes in the wide-
    character string pointed to by pwcs is not a printable wide-
    character code).

I would be willing to bet (a small amount of) money it does not correctly
inplmented Unicode print widths, even though one would certainly *think* it
does according to this:

     The wcswidth() function determines the number of column positions
     required for the first n characters of pwcs, or until a null wide
     character (L'\0') is encountered.

There are a bunch of "interesting" cases I would want it tested against.

> We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/ 

> The outcomes of this function are these:
> - F: full-width, width 2, compatibility character for a narrow char
> - H: half-width, width 1, compatibility character for a narrow char
> - W: wide, width 2
> - Na: narrow, width 1
> - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
> - N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
And EA=N cannot be consider 1, either.

For example, some of the Marks are EA=A and some are EA=N, yet how may
print columns they take varies.  It is usually 0, but can be 1 at the start
of the file/string or immediately after a linebreak sequence.  Then there
are things like the variation selectors which are never anything.

Now consider the many \pC code points, like 

    U+0009  CHARACTER TABULATION
    U+00AD  SOFT HYPHEN 
    U+200C  ZERO WIDTH NON-JOINER
    U+FEFF  ZERO WIDTH NO-BREAK SPACE
    U+2062  INVISIBLE TIMES

A TAB is its own problem but SHY we know is only width=1 immediately
before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
width=0.  So are the INVISIBLE * code points.

Context:

Imagine you're trying to format a string so that it takes up exactly 20
columns: you need to know how many spaces to pad it with based on the
print width.  That is what the #12568 is needing
to do, and you have to do much more than East Asian Width properties.

I really do think that what #12568 is asking for is to have the equivalent
of the Perl Unicode::GCString's columns() method, and that you aren't going
to be able to handle text alignment of Unicode with anything that is much
less of that.  After all, #12568's title is "Add functions to get the width
in columns of a character".  I would very much like to compare what
columns() thinks compared with what wcswidth() thinks.  I bet wcswidth() is
very simple-minded at best.

I may of course be wrong.

--tom
msg145748 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-10-17 18:37
> I'm -1 on using wcswidth, though.

When you write text into a console on Linux (e.g. displayed by gnome-terminal or konsole), I suppose that wcswidth() can be used to compute the width of a line. It would help to fix #2382.

Or do you think that wcswidth() gives the wrong result for this use case?
msg145778 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-18 07:49
>> I'm -1 on using wcswidth, though.
> 
> When you write text into a console on Linux (e.g. displayed by
> gnome-terminal or konsole), I suppose that wcswidth() can be used to
> compute the width of a line. It would help to fix #2382.
> 
> Or do you think that wcswidth() gives the wrong result for this use
> case?

No, I think that using it is not necessary. If you want to compute the
width of a line, use unicodedata.east_asian_width. And yes, wcswidth
may sometimes produce "incorrect" results (although it's probably
correct most of the time).
msg155223 - (view) Author: Nicholas Cole (Nicholas.Cole) Date: 2012-03-09 11:14
Could we have an update on the status of this? I ask because if 3.3 is going to (finally) fix unicode for curses, it would be really nice if it were possible to calculate the width of what's being displayed!  It looks as if there was never quite agreement on the proper API....
msg155236 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-09 15:05
Nicholas: I consider this issue fixed. There already *is* any API to compute the width of a character. Closing this as "works for me".
msg155307 - (view) Author: Nicholas Cole (Nicholas.Cole) Date: 2012-03-10 12:56
Martin: sorry to be completely dense, but I can't get this to work properly with the python3.3a1 build.  Could you post some example code?
msg155313 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-10 14:33
Please see the attached width.py for an example
msg155323 - (view) Author: (poq) Date: 2012-03-10 16:52
Martin, I think you meant to write "if w == 'A':".
Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

http://unicode.org/reports/tr11/ says:
"Ambiguous characters occur in East Asian legacy character sets as wide characters, but as narrow (i.e., normal-width) characters in non-East Asian usage."

So in practice applications can treat ambiguous characters as narrow by default, with a user setting to use legacy (wide) width.

As Tom pointed out there are also a bunch of zero width characters, and characters with special formatting like tab, soft hyphen, ...
msg155324 - (view) Author: Tom Christiansen (tchrist) Date: 2012-03-10 16:58
I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
which fully implements tr11.  It includes Unicode::GCString, a class
that has a columns() method to determine the print columns.  This is very
fancy in the case of Asian widths, but of course there are many other cases too.

If you'd like, I can show you a program that uses these, a rewrite the
standard Unix fmt(1) filter that works properly on Unicode column widths.

--tom
msg155337 - (view) Author: Nicholas Cole (Nicholas.Cole) Date: 2012-03-10 18:24
Marting and Poq: I think the sample code shows up a real problem. "Ambiguous" characters according to unicode may be rendered by curses in different ways.

Don't we need a function that actually reports how curses is going to print a given string, rather than just reporting what the unicode standard says?
msg155342 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-10 18:53
> Martin, I think you meant to write "if w == 'A':".
> Some very common characters have ambiguous widths though (e.g. the Greek alphabet), so you can't just raise an error for them.

That's precisely why I don't think this should be in the library, but
in the application. Application developers who need that also need
to concern themselves with the border cases, and decide on how
they need to resolve them.
msg155343 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-10 18:56
> I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
> which fully implements tr11.

Thanks for the pointer!

> If you'd like, I can show you a program that uses these, a rewrite the
> standard Unix fmt(1) filter that works properly on Unicode column widths.

I believe there can't be any truly "proper" implementation, as you
can't be certain how the terminal will handle these itself. In any
case, anybody who is interested in contributing a patch should also
be capable of understanding the source of Unicode::LineBreak.
msg155344 - (view) Author: Tom Christiansen (tchrist) Date: 2012-03-10 18:57
>Martin v. L=C3=B6wis <martin@v.loewis.de> added the comment:

>> Martin, I think you meant to write "if w =3D=3D 'A':".
>> Some very common characters have ambiguous widths though (e.g. the Greek =
>alphabet), so you can't just raise an error for them.

>That's precisely why I don't think this should be in the library, but
>in the application. Application developers who need that also need
>to concern themselves with the border cases, and decide on how
>they need to resolve them.

The column-width of a string is not an application issue.  It is
well-defined by Unicode.  Again, please see how we've done it in 
Perl, where tr11 is fully implemented.  The columns() method from 
Unicode::GCString always gives the right answer per the Standard for
any string, even what you are calling ambiguous ones.

This is not an applications issue -- at all.

--tom
msg155345 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-10 18:59
> Don't we need a function that actually reports how curses is going to
> print a given string, rather than just reporting what the unicode
> standard says?

That may be useful, but

a) this patch doesn't provide that, and
b) it may not actually possible to implement such a change in a portable
    way as there may be no function exposed by the curses implementation
    that provides this information.

To put my closing this issue differently: I rejected the patch that
Victor initially submitted. If anybody wants to contribute a different
patch that uses a different strategy, please submit a new issue.
msg155346 - (view) Author: Tom Christiansen (tchrist) Date: 2012-03-10 19:03
>Martin v. L=C3=B6wis <martin@v.loewis.de> added the comment:

>> I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
>> which fully implements tr11.

>Thanks for the pointer!

>> If you'd like, I can show you a program that uses these, a rewrite the
>> standard Unix fmt(1) filter that works properly on Unicode column widths.

>I believe there can't be any truly "proper" implementation, as you
>can't be certain how the terminal will handle these itself. 

Hm.  I think we may not be talking about the same thing after all.

If we're talking about the Curses library, or something similar,
this is not the same.  I do not think Curses has support for 
combining characters, right to left text, wide characters, etc.

However, Unicode does, and defines the column width for those.

I have an illustration of what this looks like in the picture
in the very last recipe, #44, in 

    http://training.perl.com/scripts/perlunicook.html

That is what I have been talking about by print widths.  It's running
in a Mac terminal emulator, and unlike the HTML which grabs from too
many fonts, the terminal program does the right thing with the widths.

Are we talking about different things?

--tom
msg155361 - (view) Author: (poq) Date: 2012-03-11 00:32
It seems this is a bit of a minefield...

GNOME Terminal/libvte has an environment variable (VTE_CJK_WIDTH) to override the handling of ambiguous width characters. It bases its default on the locale (with the comment 'This is basically what GNU libc does').

urxvt just uses system wcwidth.

Xterm uses some voodoo to decide between system wcwidth and mk_wcwidth(_cjk): http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

I think the simplest solution is to just expose libc's wc(s)width. It is widely used and is most likely to match the behaviour of the terminal.

FWIW I wrote a little script to test the widths of all Unicode characters, and came up with the following logic to match libvte behaviour:

def wcwidth(c, legacy_cjk=False):
	if c in u'\t\r\n\10\13\14': raise ValueError('character %r has no intrinsic width' % c)
	if c in u'\0\5\7\16\17': return 0
	if u'\u1160' <= c <= u'\u11ff': return 0 # hangul jamo
	if unicodedata.category(c) in ('Mn', 'Me', 'Cf') and c != u'\u00ad': return 0 # 00ad = soft hyphen
	eaw = unicodedata.east_asian_width(c)
	if eaw in ('F', 'W'): return 2
	if legacy_cjk and eaw == 'A': return 2
	return 1
msg155370 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-11 03:14
Tom: I don't think Unicode::GCString implements UAX#11 correctly (but this is really out of scope of this issue). In particular, it contains an ad-hoc decision to introduce the EA_Z east-asian width that UAX#11 doesn't talk about.

In most cases, it's probably reasonable to introduce this EA_Z feature. However, there are some significant deviations from UAX#11 here:
- combining characters are given EA_Z in sombok/data/custom.pl, even though UAX#11 assigns A or N. UAX#11 points out that the advance width depends on whether or not the terminal performs character combination or not. It's not clear whether Unicode::GCString aims for "strict" UAX#11, or "advance width".
- control characters are also given EA_Z, even though UAX#11 gives them EA_N. In this case, it's neither UAX#11 width nor advance width since control characters will have various effects on the terminal (in particular for the tab character)
msg155373 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-03-11 03:26
poq: I still remain opposed to exposing wcswidth, since it is just as incorrect as any of the other solutions that people have circulated. I could agree to it if it was called "wcswidth", making it clear that it does whatever the C library does, with whatever semantics the C library wants to give to it (and an availability that depends on whether the C library supports it or not). 

That would probably cover the nurses use cases, except that it is not only incorrect with respect to Unicode, but also incorrect with respect to what the terminal may be doing. I guess users would use it anyway.

For Python's internal use, I could accept using the sombok algorithm. I wouldn't expose it, since it again would trick people into believing that it was correct in some sense. Perhaps calling it sombok_width might allow for exposing it.
msg155379 - (view) Author: (poq) Date: 2012-03-11 10:52
Martin,

I agree that wcswidth is incorrect with respect to Unicode. However I don't think that's relevant at all. Python should only try to match the behaviour of the terminal.

Since terminals do slightly different things, trying to match them exactly - in all cases, on all systems - is virtually impossible. But AFAICT wcwidth should match the terminal behaviour on nearly all modern systems, so it makes sense to expose it.
msg155382 - (view) Author: Nicholas Cole (Nicholas.Cole) Date: 2012-03-11 11:43
Poq: I agree.  Guessing from the Unicode standard is going to lead to users having to write some complicated code that people are going have to reinvent over and over, and is not going to be accurate with respect to curses.  I'd favour exposing wcwidth.

Martin: I agree that there are going to be cases where it is not correct because the terminal does something strange, but what we need is something that gets as close as possible to what the terminal is likely to be doing (the Unicode standard itself is not really the issue for curses stuff).  So whether it is called wcwidth or wcswidth I don't really mind, but I think it would be useful.

The other alternative is to include one of the other ideas that have been mentioned in this thread as part of the library, I suppose, so that people don't have to keep reinventing the wheel for themselves.  

The one thing I really don't favour is shipping something that supports wide characters, but gives the users no way of guessing whether or not that is what they are printing, because that is surely going to break a lot of applications.
msg156337 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-03-19 12:59
> Martin: I agree that there are going to be cases where it is not
> correct because the terminal does something strange, but what we
> need is something that gets as close as possible to what the
> terminal is likely to be doing

Can't we expose wcswidth() as locale.strwidth() with a recipe explaining how to use unicodedata to get a "correct" result? At least until everyone implements correctly Unicode and Unicode stops evolving? :-)

--

For unicodedata, a function to get the width of a string would be more convinient than unicodedata.east_asian_width():

>>> import unicodedata
>>> unicodedata.east_asian_width('abc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: need a single Unicode character as parameter
>>> 'abc'.ljust(unicodedata.east_asian_width(' '))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer

The function posted in msg155361 looks like east_asian_width() is not enough to get the width in columns of a single character.
msg156348 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-03-19 16:00
Has anyone tested wcswidth on FreeBSD, old Solaris? With non-utf8 locales?
msg181149 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-02-02 09:05
In this part of width.py,
        w = unicodedata.east_asian_width(c)
        if c == 'A':
            # ambiguous
            raise ValueError("ambiguous character %x" % (ord(c)))

I presume that 'c' should be 'w'.
History
Date User Action Args
2013-02-02 09:05:19terry.reedysetnosy: + terry.reedy
messages: + msg181149
2013-02-02 08:53:53terry.reedysettype: enhancement
stage: patch review
2013-01-27 09:01:24serhiy.storchakalinkissue17048 dependencies
2012-03-19 16:00:42serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg156348
2012-03-19 12:59:06hayposetmessages: + msg156337
2012-03-16 17:16:13eric.araujosetmessages: - msg156059
2012-03-16 17:15:56eric.araujosetnosy: + lemburg, benjamin.peterson, eric.araujo
messages: + msg156059
2012-03-11 11:43:10Nicholas.Colesetmessages: + msg155382
2012-03-11 10:52:20poqsetmessages: + msg155379
2012-03-11 03:26:11loewissetmessages: + msg155373
2012-03-11 03:14:52loewissetmessages: + msg155370
2012-03-11 00:32:14poqsetmessages: + msg155361
2012-03-10 19:03:32tchristsetmessages: + msg155346
2012-03-10 18:59:52pitrousetstatus: closed -> open
resolution: works for me ->
2012-03-10 18:59:14loewissetmessages: + msg155345
2012-03-10 18:57:37tchristsetmessages: + msg155344
2012-03-10 18:56:43loewissetmessages: + msg155343
2012-03-10 18:53:56loewissetmessages: + msg155342
2012-03-10 18:24:03Nicholas.Colesetmessages: + msg155337
2012-03-10 16:58:29tchristsetmessages: + msg155324
2012-03-10 16:52:54poqsetnosy: + poq
messages: + msg155323
2012-03-10 14:33:00loewissetfiles: + width.py

messages: + msg155313
2012-03-10 12:56:18Nicholas.Colesetmessages: + msg155307
2012-03-09 15:05:41loewissetstatus: open -> closed
resolution: works for me
messages: + msg155236
2012-03-09 11:14:18Nicholas.Colesetmessages: + msg155223
2011-10-18 14:48:03Arfreversetnosy: + Arfrever
2011-10-18 07:49:15loewissetmessages: + msg145778
2011-10-17 18:37:01hayposetmessages: + msg145748
2011-10-14 15:33:44tchristsetmessages: + msg145535
2011-10-14 14:56:58loewissetmessages: + msg145523
2011-10-14 01:32:57hayposetmessages: + msg145498
2011-10-14 01:28:34hayposetfiles: + locale_width.patch
keywords: + patch
messages: + msg145497
2011-08-12 02:28:02tchristsetnosy: + tchrist
messages: + msg141936
2011-07-21 06:46:43ezio.melottisetnosy: + ezio.melotti
2011-07-16 10:17:24loewissetnosy: + loewis
messages: + msg140488
2011-07-15 13:31:25zehasetnosy: + zeha
2011-07-15 13:17:09Nicholas.Colesetnosy: + Nicholas.Cole
2011-07-14 22:43:56haypocreate