Message 29287 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	georg.brandl
Recipients
Date	2006-08-17.15:03:33
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=849994 sgala: it looks like your console sends UTF-8 encoded text. >>> print "á" á print is just printing out a byte string consisting of two bytes, which your console displays as accent-a. >>> print len("á") 2 A UTF-8-encoded string containing an accented a has two bytes. >>> print "á".upper() á str.upper() doesn't take locale into account, so the accented a has no uppercase version defined. >>> str("á") '\xc3\xa1' str() applied to a byte string returns that byte string. Since return values from functions are printed by the interactive interpreter using repr() first, you get this representation (which you could also get from "print repr('a')".) >>> print u"á" á That's also okay. Python knows the terminal encoding and properly translates the byte string to a unicode string of one character. On printout, it converts it to a UTF-8 string again, which your terminal displays correctly. >>> print len(u"á") 1 Since your two-byte-UTF-8 sequence is converted to a unicode character, the length of this unicode string is 1. >>> print u"á".upper() Ã There are comprehensive capitalization tables available for unicode. >>> str(u"á") Traceback (most recent call last): File "<stdin>", line 1, in <module> __builtin__.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) Applying str() to a unicode string must convert it to a byte string. If you don't specify an encoding, the default encoding is "ascii", which can't encode the accented a. Use "a".encode("utf-8").

Logged In: YES 
user_id=849994

sgala: it looks like your console sends UTF-8 encoded text.

>>> print "á"
á

print is just printing out a byte string consisting of two
bytes, which your console displays as accent-a.

>>> print len("á")
2

A UTF-8-encoded string containing an accented a has two bytes.

>>> print "á".upper()
á

str.upper() doesn't take locale into account, so the
accented a has no uppercase version defined.

>>> str("á")
'\xc3\xa1'

str() applied to a byte string returns that byte string.
Since return values from functions are printed by the
interactive interpreter using repr() first, you get this
representation (which you could also get from "print
repr('a')".)

>>> print u"á"
á

That's also okay. Python knows the terminal encoding and
properly translates the byte string to a unicode string of
one character. On printout, it converts it to a UTF-8 string
again, which your terminal displays correctly.

>>> print len(u"á")
1

Since your two-byte-UTF-8 sequence is converted to a unicode
character, the length of this unicode string is 1.

>>> print u"á".upper()
Ã

There are comprehensive capitalization tables available for
unicode.

>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec
can't encode
character u'\xe1' in position 0: ordinal not in
range(128)

Applying str() to a unicode string must convert it to a byte
string. If you don't specify an encoding, the default
encoding is "ascii", which can't encode the accented a. Use
"a".encode("utf-8").

History
Date	User	Action	Args
2007-08-23 14:41:34	admin	link	issue1528802 messages
2007-08-23 14:41:34	admin	create