classification
Title: Unicode arguments in str.format()
Type: behavior Stage: test needed
Components: Interpreter Core Versions: Python 2.7, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: doerwalter, eric.smith, ezio.melotti, flox, haypo, pablomouzo
Priority: high Keywords: patch

Created on 2009-11-10 13:57 by doerwalter, last changed 2010-09-09 19:04 by flox.

Files
File name Uploaded Description Edit
issue7300-trunk.patch haypo, 2010-03-09 23:44 review
Messages (5)
msg95114 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-11-10 13:57
str.format() doesn't handle unicode arguments:

Python 2.6.4 (r264:75706, Oct 27 2009, 15:18:04) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '{0}'.format(u'\u3042')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in
position 0: ordinal not in range(128)

Unicode arguments should be treated in the same way as the % operator
does it: by promoting the format string to unicode:

>>> '%s' % u'\u3042'
u'\u3042'
msg100769 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 22:57
PyString_Format() uses a "goto unicode;" if a '%c' or '%s' argument is unicode. The unicode label converts the partial formatted result (byte string) to unicode, and use PyUnicode_Format() to finish to formatting.

I don't think that you can apply the same algorithm here (converts the partial result to unicode) because it requires to rewrite the format string: arguments can be used twice or more, and used in any order.

Example: "{0} {1}".format("bytes", u"unicode") => switch to unicode occurs at result="bytes ", format=" {1}", arguments=(u"unicode"). Converts "bytes " to unicode is easy, but the format have to be rewritten in " {0}" or something else.

Call trace of str.format(): do_string_format() -> build_string() -> output_markup() -> render_field(). The argument type is proceed in render_field().
msg100770 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 23:44
*Draft* patch fixing the issue: render_field() raises an error if the argument is an unicode argument, string_format() catchs this error and converts self to unicode and call unicode.format(*args, **kw).

Pseudo-code:

 try:
    # self.format() raises an error if any argument is 
    # an unicode string)
    return self.format(*args,**kw)
 except UnicodeError:
    unicode = self.decode(default_encoding)
    return unicode.format(*args, **kw)

The patch changes the result type of '{}'.format(u'ascii'): it was str and it becomes unicode. The new behaviour is consistent with "%s" % u"ascii" => u"ascii" (unicode).

I'm not sure that catching *any* unicode error is a good idea. I think that it would be better to use a new exception type dedicated to this issue, but it looks complex to define a new exception. I will may do it for the next patch version ;-)
msg100771 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 23:50
My patch converts the format string to unicode using the default encoding. It's inconsistent with str%args: str%args converts str to unicode using the ASCII charset (if a least one argument is an unicode string), not the default encoding.

>>> "\xff%s" % u'\xe9'
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
msg100861 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-03-11 14:52
I'm not sure I'm wild about doing the work twice, once as string and once as unicode if need be. But I'll consider it, especially since this is only a 2.7 issue.

There could be side effects of evaluating the replacement strings, but I'm not sure it's worth worrying about. Attribute (or index) access having side effects isn't something I think we need to cater to.
History
Date User Action Args
2010-09-09 19:04:19floxsetnosy: + flox
2010-03-11 14:52:28eric.smithsetmessages: + msg100861
2010-03-09 23:50:03hayposetmessages: + msg100771
2010-03-09 23:44:34hayposetfiles: + issue7300-trunk.patch
keywords: + patch
messages: + msg100770
2010-03-09 22:57:46hayposetmessages: + msg100769
2010-03-08 01:18:41pablomouzosetnosy: + pablomouzo
2010-03-07 21:32:49hayposetnosy: + haypo
2010-01-14 00:12:14ezio.melottisetnosy: + ezio.melotti
2009-11-14 02:09:07ezio.melottisetpriority: high
stage: test needed
versions: + Python 2.7
2009-11-10 13:57:33doerwaltercreate