Message 145535 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Nicholas.Cole, ezio.melotti, inigoserna, loewis, tchrist, vstinner, zeha
Date	2011-10-14.15:33:43
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<634.1318606411@chthon>
In-reply-to	<1318604219.18.0.9022990154.issue12568@psf.upfronthosting.co.za>

Content
> Martin v. Löwis <martin@v.loewis.de> added the comment: > I think the WideCharToMultibyte approach is just incorrect. > I'm -1 on using wcswidth, though. Like you, I too seriously question using wcswidth() for this at all: The wcswidth() function either shall return 0 (if pwcs points to a null wide-character code), or return the number of column positions to be occupied by the wide-character string pointed to by pwcs, or return -1 (if any of the first n wide-character codes in the wide- character string pointed to by pwcs is not a printable wide- character code). I would be willing to bet (a small amount of) money it does not correctly inplmented Unicode print widths, even though one would certainly think it does according to this: The wcswidth() function determines the number of column positions required for the first n characters of pwcs, or until a null wide character (L'\0') is encountered. There are a bunch of "interesting" cases I would want it tested against. > We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/ > The outcomes of this function are these: > - F: full-width, width 2, compatibility character for a narrow char > - H: half-width, width 1, compatibility character for a narrow char > - W: wide, width 2 > - Na: narrow, width 1 > - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context > - N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1 Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this. And EA=N cannot be consider 1, either. For example, some of the Marks are EA=A and some are EA=N, yet how may print columns they take varies. It is usually 0, but can be 1 at the start of the file/string or immediately after a linebreak sequence. Then there are things like the variation selectors which are never anything. Now consider the many \pC code points, like U+0009 CHARACTER TABULATION U+00AD SOFT HYPHEN U+200C ZERO WIDTH NON-JOINER U+FEFF ZERO WIDTH NO-BREAK SPACE U+2062 INVISIBLE TIMES A TAB is its own problem but SHY we know is only width=1 immediately before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly width=0. So are the INVISIBLE * code points. Context: Imagine you're trying to format a string so that it takes up exactly 20 columns: you need to know how many spaces to pad it with based on the print width. That is what the #12568 is needing to do, and you have to do much more than East Asian Width properties. I really do think that what #12568 is asking for is to have the equivalent of the Perl Unicode::GCString's columns() method, and that you aren't going to be able to handle text alignment of Unicode with anything that is much less of that. After all, #12568's title is "Add functions to get the width in columns of a character". I would very much like to compare what columns() thinks compared with what wcswidth() thinks. I bet wcswidth() is very simple-minded at best. I may of course be wrong. --tom

> Martin v. Löwis <martin@v.loewis.de> added the comment:

> I think the WideCharToMultibyte approach is just incorrect.

> I'm -1 on using wcswidth, though. 

Like you, I too seriously question using wcswidth() for this at all:

    The wcswidth() function either shall return 0 (if pwcs points to a
    null wide-character code), or return the number of column positions
    to be occupied by the wide-character string pointed to by pwcs, or
    return -1 (if any of the first n wide-character codes in the wide-
    character string pointed to by pwcs is not a printable wide-
    character code).

I would be willing to bet (a small amount of) money it does not correctly
inplmented Unicode print widths, even though one would certainly *think* it
does according to this:

     The wcswidth() function determines the number of column positions
     required for the first n characters of pwcs, or until a null wide
     character (L'\0') is encountered.

There are a bunch of "interesting" cases I would want it tested against.

> We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/ 

> The outcomes of this function are these:
> - F: full-width, width 2, compatibility character for a narrow char
> - H: half-width, width 1, compatibility character for a narrow char
> - W: wide, width 2
> - Na: narrow, width 1
> - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
> - N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1

Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
And EA=N cannot be consider 1, either.

For example, some of the Marks are EA=A and some are EA=N, yet how may
print columns they take varies.  It is usually 0, but can be 1 at the start
of the file/string or immediately after a linebreak sequence.  Then there
are things like the variation selectors which are never anything.

Now consider the many \pC code points, like 

    U+0009  CHARACTER TABULATION
    U+00AD  SOFT HYPHEN 
    U+200C  ZERO WIDTH NON-JOINER
    U+FEFF  ZERO WIDTH NO-BREAK SPACE
    U+2062  INVISIBLE TIMES

A TAB is its own problem but SHY we know is only width=1 immediately
before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
width=0.  So are the INVISIBLE * code points.

Context:

Imagine you're trying to format a string so that it takes up exactly 20
columns: you need to know how many spaces to pad it with based on the
print width.  That is what the #12568 is needing
to do, and you have to do much more than East Asian Width properties.

I really do think that what #12568 is asking for is to have the equivalent
of the Perl Unicode::GCString's columns() method, and that you aren't going
to be able to handle text alignment of Unicode with anything that is much
less of that.  After all, #12568's title is "Add functions to get the width
in columns of a character".  I would very much like to compare what
columns() thinks compared with what wcswidth() thinks.  I bet wcswidth() is
very simple-minded at best.

I may of course be wrong.

--tom

History
Date	User	Action	Args
2011-10-14 15:33:45	tchrist	set	recipients: + tchrist, loewis, vstinner, ezio.melotti, inigoserna, zeha, Nicholas.Cole
2011-10-14 15:33:44	tchrist	link	issue12568 messages
2011-10-14 15:33:43	tchrist	create