Issue1067294
Created on 2004-11-16 11:58 by edschofield, last changed 2004-11-16 12:12 by lemburg.
| File name |
Uploaded |
Description |
Edit |
Remove |
|
python unicode char length bug.txt
|
edschofield,
2004-11-16 11:58
|
Code example exposing a bug in determining the length of utf-8 encoded strings |
|
|
|
msg23167 - (view) |
Author: Ed Schofield (edschofield) |
Date: 2004-11-16 11:58 |
|
Python 2.3.4 and Python 2.4b2:
print "x = %-15s" %(x.encode('utf-8'),) + " more text"
gives an incorrect number of spaces when x is a
two-byte unicode character like à. There is no such
problem if x is used alone rather than its encode(...)
method.
The reason seems to be this: if x = u'\u00e0' (the
character à) and s=x.encode('utf-8'), then len(s) = 2,
which breaks the print command above on a UTF-8 terminal.
A slightly longer example is attached.
|
|
msg23168 - (view) |
Author: Marc-Andre Lemburg (lemburg) |
Date: 2004-11-16 12:12 |
|
Logged In: YES
user_id=38388
As you already noted: the problem is that you are mixing Unicode
and strings in a way which is bound to fail.
You should use:
print (u"x = %-15s" %x + u" more text").encode('utf-8')
ie. stay with Unicode as long as you can and only call encode
when doing I/O as last step before passing off the string
to an 8-bit stream.
|
|
| Date |
User |
Action |
Args |
| 2004-11-16 11:58:42 | edschofield | create | |
|