Message145535
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> I think the WideCharToMultibyte approach is just incorrect.
> I'm -1 on using wcswidth, though.
Like you, I too seriously question using wcswidth() for this at all:
The wcswidth() function either shall return 0 (if pwcs points to a
null wide-character code), or return the number of column positions
to be occupied by the wide-character string pointed to by pwcs, or
return -1 (if any of the first n wide-character codes in the wide-
character string pointed to by pwcs is not a printable wide-
character code).
I would be willing to bet (a small amount of) money it does not correctly
inplmented Unicode print widths, even though one would certainly *think* it
does according to this:
The wcswidth() function determines the number of column positions
required for the first n characters of pwcs, or until a null wide
character (L'\0') is encountered.
There are a bunch of "interesting" cases I would want it tested against.
> We already have unicodedata.east_asian_width, which implements http://unicode.org/reports/tr11/
> The outcomes of this function are these:
> - F: full-width, width 2, compatibility character for a narrow char
> - H: half-width, width 1, compatibility character for a narrow char
> - W: wide, width 2
> - Na: narrow, width 1
> - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
> - N: neutral; not used in Asian text, so has no width. Practically, width can be considered as 1
Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
And EA=N cannot be consider 1, either.
For example, some of the Marks are EA=A and some are EA=N, yet how may
print columns they take varies. It is usually 0, but can be 1 at the start
of the file/string or immediately after a linebreak sequence. Then there
are things like the variation selectors which are never anything.
Now consider the many \pC code points, like
U+0009 CHARACTER TABULATION
U+00AD SOFT HYPHEN
U+200C ZERO WIDTH NON-JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE
U+2062 INVISIBLE TIMES
A TAB is its own problem but SHY we know is only width=1 immediately
before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
width=0. So are the INVISIBLE * code points.
Context:
Imagine you're trying to format a string so that it takes up exactly 20
columns: you need to know how many spaces to pad it with based on the
print width. That is what the #12568 is needing
to do, and you have to do much more than East Asian Width properties.
I really do think that what #12568 is asking for is to have the equivalent
of the Perl Unicode::GCString's columns() method, and that you aren't going
to be able to handle text alignment of Unicode with anything that is much
less of that. After all, #12568's title is "Add functions to get the width
in columns of a character". I would very much like to compare what
columns() thinks compared with what wcswidth() thinks. I bet wcswidth() is
very simple-minded at best.
I may of course be wrong.
--tom |
|
Date |
User |
Action |
Args |
2011-10-14 15:33:45 | tchrist | set | recipients:
+ tchrist, loewis, vstinner, ezio.melotti, inigoserna, zeha, Nicholas.Cole |
2011-10-14 15:33:44 | tchrist | link | issue12568 messages |
2011-10-14 15:33:43 | tchrist | create | |
|