Author georg.brandl
Recipients
Date 2006-08-17.15:03:33
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=849994

sgala: it looks like your console sends UTF-8 encoded text.

>>> print "á"
á

print is just printing out a byte string consisting of two
bytes, which your console displays as accent-a.

>>> print len("á")
2

A UTF-8-encoded string containing an accented a has two bytes.

>>> print "á".upper()
á

str.upper() doesn't take locale into account, so the
accented a has no uppercase version defined.

>>> str("á")
'\xc3\xa1'

str() applied to a byte string returns that byte string.
Since return values from functions are printed by the
interactive interpreter using repr() first, you get this
representation (which you could also get from "print
repr('a')".)

>>> print u"á"
á

That's also okay. Python knows the terminal encoding and
properly translates the byte string to a unicode string of
one character. On printout, it converts it to a UTF-8 string
again, which your terminal displays correctly.

>>> print len(u"á")
1

Since your two-byte-UTF-8 sequence is converted to a unicode
character, the length of this unicode string is 1.

>>> print u"á".upper()
Á

There are comprehensive capitalization tables available for
unicode.

>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec
can't encode
character u'\xe1' in position 0: ordinal not in
range(128)

Applying str() to a unicode string must convert it to a byte
string. If you don't specify an encoding, the default
encoding is "ascii", which can't encode the accented a. Use
"a".encode("utf-8").
History
Date User Action Args
2007-08-23 14:41:34adminlinkissue1528802 messages
2007-08-23 14:41:34admincreate