classification
Title: Unicode arguments in str.format()
Type: behavior Stage: test needed
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: Pedro.Algarvio, chris.jerdonek, doerwalter, eric.smith, ezio.melotti, flox, georg.brandl, gkcn, haypo, pablomouzo
Priority: high Keywords: patch

Created on 2009-11-10 13:57 by doerwalter, last changed 2013-03-28 10:33 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
issue7300-trunk.patch haypo, 2010-03-09 23:44 review
Messages (10)
msg95114 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-11-10 13:57
str.format() doesn't handle unicode arguments:

Python 2.6.4 (r264:75706, Oct 27 2009, 15:18:04) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '{0}'.format(u'\u3042')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in
position 0: ordinal not in range(128)

Unicode arguments should be treated in the same way as the % operator
does it: by promoting the format string to unicode:

>>> '%s' % u'\u3042'
u'\u3042'
msg100769 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 22:57
PyString_Format() uses a "goto unicode;" if a '%c' or '%s' argument is unicode. The unicode label converts the partial formatted result (byte string) to unicode, and use PyUnicode_Format() to finish to formatting.

I don't think that you can apply the same algorithm here (converts the partial result to unicode) because it requires to rewrite the format string: arguments can be used twice or more, and used in any order.

Example: "{0} {1}".format("bytes", u"unicode") => switch to unicode occurs at result="bytes ", format=" {1}", arguments=(u"unicode"). Converts "bytes " to unicode is easy, but the format have to be rewritten in " {0}" or something else.

Call trace of str.format(): do_string_format() -> build_string() -> output_markup() -> render_field(). The argument type is proceed in render_field().
msg100770 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 23:44
*Draft* patch fixing the issue: render_field() raises an error if the argument is an unicode argument, string_format() catchs this error and converts self to unicode and call unicode.format(*args, **kw).

Pseudo-code:

 try:
    # self.format() raises an error if any argument is 
    # an unicode string)
    return self.format(*args,**kw)
 except UnicodeError:
    unicode = self.decode(default_encoding)
    return unicode.format(*args, **kw)

The patch changes the result type of '{}'.format(u'ascii'): it was str and it becomes unicode. The new behaviour is consistent with "%s" % u"ascii" => u"ascii" (unicode).

I'm not sure that catching *any* unicode error is a good idea. I think that it would be better to use a new exception type dedicated to this issue, but it looks complex to define a new exception. I will may do it for the next patch version ;-)
msg100771 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-03-09 23:50
My patch converts the format string to unicode using the default encoding. It's inconsistent with str%args: str%args converts str to unicode using the ASCII charset (if a least one argument is an unicode string), not the default encoding.

>>> "\xff%s" % u'\xe9'
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
msg100861 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2010-03-11 14:52
I'm not sure I'm wild about doing the work twice, once as string and once as unicode if need be. But I'll consider it, especially since this is only a 2.7 issue.

There could be side effects of evaluating the replacement strings, but I'm not sure it's worth worrying about. Attribute (or index) access having side effects isn't something I think we need to cater to.
msg178596 - (view) Author: Pedro Algarvio (Pedro.Algarvio) Date: 2012-12-30 18:35
This is not a 2.7 issue only:

>>> import sys
>>> sys.version_info
(2, 6, 5, 'final', 0
>>> 'Foo {0}'.format(u'bár')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 1: ordinal not in range(128)
>>>
msg178599 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-12-30 18:45
2.6 only gets security fixes.

> My patch converts the format string to unicode using the default 
> encoding. It's inconsistent with str%args: str%args converts str to 
> unicode using the ASCII charset (if a least one argument is an unicode 
> string), not the default encoding.

I think it's better to be consistent and use ASCII.
msg178617 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-12-30 21:49
Another option is to decide that this issue will *not* be fixed in Python 2, and Python 3 *is* the good solution if you have this issue.

Doing the work twice can cause new problems, formatting an argument twice may return two different values :-( It may have an impact on performances and may introduce regressions.

Oh by the way, it's trivial to workaround this issue in Python 2: just use a Unicode format string. For example, replace '{0}'.format(u'\u3042') with u'{0}'.format(u'\u3042').

I hate implicit conversion from bytes to Unicode in Python 2, it's maybe better to not add a new special case?
msg178618 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2012-12-30 21:52
I agree that we should close this as "won't fix" in 2.7.
msg185426 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-03-28 10:33
Agreed with Eric.
History
Date User Action Args
2013-03-28 10:33:27georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg185426

resolution: wont fix
2012-12-30 21:52:06eric.smithsetmessages: + msg178618
2012-12-30 21:49:10hayposetmessages: + msg178617
2012-12-30 18:45:16ezio.melottisetmessages: + msg178599
2012-12-30 18:35:48Pedro.Algarviosetnosy: + Pedro.Algarvio
messages: + msg178596
2012-09-26 18:50:11ezio.melottisetnosy: + chris.jerdonek

versions: - Python 2.6
2012-06-11 11:40:34gkcnsetnosy: + gkcn
2010-09-09 19:04:19floxsetnosy: + flox
2010-03-11 14:52:28eric.smithsetmessages: + msg100861
2010-03-09 23:50:03hayposetmessages: + msg100771
2010-03-09 23:44:34hayposetfiles: + issue7300-trunk.patch
keywords: + patch
messages: + msg100770
2010-03-09 22:57:46hayposetmessages: + msg100769
2010-03-08 01:18:41pablomouzosetnosy: + pablomouzo
2010-03-07 21:32:49hayposetnosy: + haypo
2010-01-14 00:12:14ezio.melottisetnosy: + ezio.melotti
2009-11-14 02:09:07ezio.melottisetpriority: high
stage: test needed
versions: + Python 2.7
2009-11-10 13:57:33doerwaltercreate