classification
Title: unicode(exception) and str(exception) should return the same message on Py2.6
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: barry, cvrebert, exarkun, ezio.melotti, ncoghlan, pitrou, rbcollins
Priority: release blocker Keywords: patch

Created on 2009-05-26 04:53 by ezio.melotti, last changed 2009-12-24 23:03 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
issue6108_testcase.diff ezio.melotti, 2009-11-12 05:22 Testcase that checks if str() and unicode() return the same message
output_on_py26.txt ezio.melotti, 2009-11-12 05:24 Output of unicode_exceptions.py on Python 2.6
unicode_exceptions.py ezio.melotti, 2009-11-12 05:25 Example script that shows the str() and unicode() of some exceptions
issue6108.diff ezio.melotti, 2009-12-12 03:21 Proof of concept that fixes UnicodeDecodeException
issue6108-2.patch ezio.melotti, 2009-12-13 04:39 Patch that makes all the tests in issue6108_testcase pass (except for KeyError).
issue6108-3.patch ezio.melotti, 2009-12-13 06:01 Patch that makes all the tests in issue6108_testcase pass.
issue6108-4.patch ezio.melotti, 2009-12-20 19:29 Patch + unittests against trunk
issue6108-5.patch ezio.melotti, 2009-12-21 16:00 Patch + unittests against trunk
issue6108-6.patch ezio.melotti, 2009-12-21 17:40 Patch + unittests against trunk
Messages (25)
msg88330 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-05-26 04:52
On Python 2.5 str(exception) and unicode(exception) return the same text:
>>> err
UnicodeDecodeError('ascii', '\xc3\xa0', 0, 1, 'ordinal not in range(128)')
>>> str(err)
"'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
range(128)"
>>> unicode(err)
u"'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
range(128)"

On Python 2.6 unicode(exception) returns unicode(exception.args):
>>> err
UnicodeDecodeError('ascii', '\xc3\xa0', 0, 1, 'ordinal not in range(128)')
>>> str(err)
"'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in
range(128)"
>>> unicode(err)
u"('ascii', '\\xc3\\xa0', 0, 1, 'ordinal not in range(128)')"

This seems to affect only exceptions with more than 1 arg (e.g.
UnicodeErrors and SyntaxErrors). KeyError is also different (the '' are
missing with unicode()).

Note that when an exception like ValueError() is instantiated with more
than 1 arg even str() returns str(exception.args) on both Py2.5 and Py2.6.

Probably __str__() checks the number of args before returning a specific
message and if it doesn't match it returns str(self.args). __unicode__()
instead seems to always return unicode(self.args) on Py2.6.

Attached there's a script that prints the repr(), str() and unicode() of
some exceptions, run it on Py2.5 and Py2.6 to see the differences.
msg92561 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2009-09-13 05:00
Perhaps also worth noting is that in Python 2.4 as well, str(exception)
and unicode(exception) returned the same thing.  Unlike some other
exception changes in 2.6, this doesn't seem to be a return to older
behavior, but just a new behavior.  (Or maybe no one cares about that;
just wanted to point it out, though.)
msg92568 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-09-13 14:36
Looks like a potentially annoying bug to me.
msg93313 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2009-09-29 18:31
Since we do not yet have a patch for this, I'm knocking it off the list
for 2.6.3.  It seems like an annoying loss of compatibility, but do we
have any reports of it breaking real-world code?
msg95158 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-11-12 05:22
I added the output of unicode_exceptions.py on Py2.6 and a testcase
(against the trunk) that fails for 5 different exceptions, including the
IOError mentioned in #6890 (also added to unicode_exceptions.py).
The problem has been introduced by #2517.
msg96281 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-12 03:21
In r64791, BaseException gained a new __unicode__ method that does the
equivalent of the following things:
 * if the number of args is 0, returns u''
 * if it's 1 returns unicode(self.args[0])
 * if it's >1 returns unicode(self.args)

Before this, BaseException only had a __str__ method, so unicode(e)
(with e being an exception derived from BaseException) called:
 * e.__str__().decode(), if e didn't implement __unicode__
 * e.__unicode__(), if e implemented an __unicode__ method

Now, all the derived exceptions that don't implement their own
__unicode__ method inherit the "generic" __unicode__ of BaseException,
and they use that instead of falling back on __str__.
This is generally ok if the numbers of args is 0 or 1, but if there are
more args, there's usually some specific formatting in the __str__
method that is lost when BaseException.__unicode__ returns
unicode(self.args).

Possible solutions:
 1) implement a __unicode__ method that does the equivalent of calling
unicode(str(self)) (i.e. converting to unicode the message returned by
__str__ instead of converting self.args);
 2) implement a __unicode__ method that formats the message as __str__
for all the exceptions with a __str__ that does some specific formatting;

Attached there's a proof of concept (issue6108.diff) where I tried to
implement the first method with UnicodeDecodeError. This method can be
used as long as __str__ always returns only ascii.

The patch seems to work fine for me (note: this is my first attempt to
use the C API). If the approach is correct I can do the same for the
other exceptions too and submit a proper patch.
msg96297 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-12 15:53
> In r64791, BaseException gained a new __unicode__ method that does the
> equivalent of the following things:

It remains to be seen why that behaviour was chosen. Apparently Nick
implemented it.
IMO __unicode__ should have the same behaviour as __str__. There's no
reason to implement two different formatting algorithms.
msg96313 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2009-12-13 01:09
Following this down the rabbit hole a little further: Issue #2517 (the
origin of my checkin) was just a restoration of the __unicode__ slot
implementation that had been ripped out in r51837 due to Issue #1551432.

At the time of the r64791 checkin, BaseException_str and
BaseException_unicode were identical aside from the type of object
returned (checking SVN head shows they're actually still identical).

However, it looks like several exceptions with __str__ overrides (i.e.
Unicode[Encode/Decode/Translate]Error_str, EnvironmentError_str,
WindowsError_str. SyntaxError_str, KeyError_str) are missing
corresponding __unicode__ overrides, so invoking unicode() on them falls
back to the BaseException_unicode implementation instead of using the
custom formatting behaviour of the subclass.
msg96314 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-13 01:39
> IMO __unicode__ should have the same behaviour as __str__. There's no
> reason to implement two different formatting algorithms.

If BaseException has both the methods they have to be both overridden by
derived exceptions in order to have the same behaviour. The simplest way
to do it is to convert the string returned by __str__ to unicode, as I
did in issue6108.diff.
If you have better suggestions let me know.
msg96315 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-13 01:52
Well the obvious problem with this approach is that it won't work if
__str__() returns a non-ascii string. The only working solution would be
to replicate the functioning of __str__() in each __unicode__()
implementation.
msg96318 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2009-12-13 04:24
As Antoine said, there's a reason BaseException now implements both
__str__ and __unicode__ and doesn't implement the latter in terms of the
former - it's the only way to consistently support Unicode arguments
that can't be encoded to an 8-bit ASCII string:

>>> str(Exception(u"\xc3\xa0"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
>>> unicode(Exception(u"\xc3\xa0"))
u'\xc3\xa0'

For some of the exception subclasses that will always return ASCII (e.g.
KeyError, which calls repr() on its arguments) then defining __unicode__
in terms of __str__ as Ezio suggests will work.

For others (as happened with BaseException itself), the __unicode__
method will need to be a reimplementation that avoids trying to encode
potentially non-ASCII characters into an 8-bit ASCII string.
msg96319 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-13 04:39
What you said is only a special case, and I agree that the solution
introduced with r64791 is correct for that. However, that fix has the
side effect of breaking the code in other situations.


To summarize the possible cases and the behaviours I prepared the
following list (odd numbers -> BaseException; even numbers -> any
exception with overridden __str__ and no __unicode__.):
1) 0 args, e = Exception():
   py2.5  : str(e) -> ''; unicode(e) -> u''
   py2.6  : str(e) -> ''; unicode(e) -> u''
   desired: str(e) -> ''; unicode(e) -> u''
Note: this is OK

2) 0 args, e = MyException(), with overridden __str__:
   py2.5  : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
   py2.6  : str(e) -> 'ascii' or error; unicode(e) -> u''
   desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
Note: py2.5 behaviour is better: if __str__ returns an ascii string
(including ''), unicode(e) should return the same string decoded, if
__str__ returns a non-ascii string, both should raise an error.

3a) 1 str arg, e = Exception('foo'):
   py2.5  : str(e) -> 'foo'; unicode(e) -> u'foo'
   py2.6  : str(e) -> 'foo'; unicode(e) -> u'foo'
   desired: str(e) -> 'foo'; unicode(e) -> u'foo'
Note: this is OK

3b) 1 non-ascii unicode arg, e = Exception(u'föö'):
   py2.5  : str(e) -> error; unicode(e) -> error
   py2.6  : str(e) -> error; unicode(e) -> u'föö'
   desired: str(e) -> error; unicode(e) -> u'föö'
Note: py2.6 behaviour is better: unicode(e) should return u'föö'

4) 1 unicode arg, e = MyException(u'föö'), with overridden __str__:
   py2.5  : str(e) -> error or 'ascii'; unicode(e) -> error or u'ascii'
   py2.6  : str(e) -> error or 'ascii'; unicode(e) -> u'föö'
   desired: str(e) -> error or 'ascii'; unicode(e) -> error or u'ascii'
Note: py2.5 behaviour is better: if __str__ returns an ascii string
str(e) should work, otherwise it should raise an error. unicode(e)
should return the ascii string decoded or an error, not the arg.

5) >1 args of any type, e = Exception('foo', u'föö', 5):
   py2.5  : str(e) ->  "('foo', u'f\\xf6\\xf6', 5)";
        unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
   py2.6  : str(e) ->  "('foo', u'f\\xf6\\xf6', 5)";
        unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
   desired: str(e) ->  "('foo', u'f\\xf6\\xf6', 5)";
        unicode(e) -> u"('foo', u'f\\xf6\\xf6', 5)";
Note: this is OK

6) >1 args of any type, e = MyException('foo', u'föö', 5), with
overridden __str__:
   py2.5  : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
   py2.6  : str(e) -> 'ascii' or error; unicode(e) -> u"('foo',
u'f\\xf6\\xf6', 5)";
   desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
Note: py2.5 behaviour is better: if __str__ returns an ascii string,
unicode(e) should return the same string decoded, if __str__ returns a
non-ascii string, both should raise an error.

As you can see, your example corresponds just to the case 3b) (now
fixed), but cases 2, 4, 6 are now broken.

Making this list allowed me to come out with a new patch, that seems to
solve all the problems (2, 4 and 6 while leaving 3b as it is now). The
only exception is for KeyError, if we want it to print the repr, then
KeyError_unicode should be implemented, but I think that Python only
calls str() so it's probably not necessary.

Attached new patch that passes all the tests in issue6108_testcase
except for KeyError. Unless you disagree with the 'desired behaviours'
that I listed, this patch should fix the issue.
msg96321 - (view) Author: Robert Collins (rbcollins) * Date: 2009-12-13 04:44
"2) 0 args, e = MyException(), with overridden __str__:
   py2.5  : str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
   py2.6  : str(e) -> 'ascii' or error; unicode(e) -> u''
   desired: str(e) -> 'ascii' or error; unicode(e) -> u'ascii' or error;
Note: py2.5 behaviour is better: if __str__ returns an ascii string
(including ''), unicode(e) should return the same string decoded, if
__str__ returns a non-ascii string, both should raise an error.
"

I'm not sure how you justify raising an unnecessary error when trying to
stringify an exception as being 'better'.

__str__ should not decode its arguments if they are already strings:
they may be valid data for the user even if they are not decodable (and
note that an implicit decode may try to decode('ascii') which is totally
useless.

__str__ and __unicode__ are /different/ things, claiming they have to
behave the same is equivalent to claiming either that we don't need
unicode, or that we don't need binary data.

Surely there is space for both things, which does imply that
unicode(str(e)) != unicode(e).

Why _should_ that be the same anyway?
msg96322 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2009-12-13 04:49
I agree the 2.6 implementation creates backwards compatibility problems
with subclasses that only override __str__ that we didn't recognise at
the time.

An alternative approach that should work even for the KeyError case is
for BaseException_unicode to check explicitly for the situation where
the __str__ slot has been overridden but __unicode__ is still the
BaseException version and invoke "PyObject_Unicode(PyObject_Str(self))"
when it detects that situation.

That way subclasses that only override __str__ would continue to see the
old behaviour, while subclasses that don't override either would
continue to benefit from the new behaviour.
msg96323 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-13 04:51
Assume the case of e = MyException() (note: 0 args) with a __str__ that
returns a default message. Now, if the message is ascii, str(e) works
and the user see the default message but unicode(e) returns a
not-so-useful empty string.
On the other hand, if __str__ returns a non-ascii string, then it's
wrong in the first place, because str(e) will fail and returning an
empty string with unicode(e) is not going to help.
msg96324 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-13 04:56
> An alternative approach that should work even for the KeyError case is
> for BaseException_unicode to check explicitly for the situation where
> the __str__ slot has been overridden but __unicode__ is still the
> BaseException version and invoke "PyObject_Unicode(PyObject_Str(self))"
> when it detects that situation.

This is even better, I'll try to do it.
msg96325 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-13 06:01
Here is a new patch (issue6108-3.patch) that checks if __str__ has been
overridden and calls PyObject_Unicode(PyObject_Str(self)).

All the tests (including the one with KeyError) in
issue6108_testcase.diff now pass.

If the patch is OK I'll make sure that the tests cover all the possible
cases that I listed and possibly add a few more before the commit.
msg96331 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-13 12:52
You should check the return value from PyObject_Str().
msg96717 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-20 19:29
I created a comprehensive set of tests to check all the possibilities
that I listed in msg96319 and updated the patch for Object/exceptions.c.
Without patch all the test_*_with_overridden__str__ and
test_builtin_exceptions fail, both on 2.6 and on trunk, with the patch
all the tests pass.
The code in exceptions.c now does the equivalent of unicode(e.__str__())
instead of unicode(str(e)). If e.__str__() returns a non-ascii unicode
string, unicode() now shows the message instead of raising an error.
msg96742 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-21 11:54
> I created a comprehensive set of tests to check all the possibilities
> that I listed in msg96319 and updated the patch for Object/exceptions.c.

Great!
Small thing: in tests, you should use setUp() to initialize test data
rather than __init__().
msg96755 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-21 16:00
I updated the patch and moved the helper class outside the __init__.
msg96756 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-21 16:05
This looks fine, module the slight style issue mentioned on IRC. Please
commit after you fix it.
(this is assuming all tests pass, of course!)
msg96761 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-21 17:40
This should be the final patch (issue6108-6.patch). I update the
comments, checked that (some of) the tests fail without the patch, that
they (all) pass with it and that there are no leaks.
I plan to backport this on 2.6 and possibly port the tests to py3k and 3.1.
msg96762 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-12-21 18:34
It's ok for me.
msg96872 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-12-24 23:03
Fixed in r77045 (trunk) and r77046 (release26-maint). No need to port it
to py3k since unicode() is gone.
History
Date User Action Args
2009-12-24 23:03:08ezio.melottisetstatus: open -> closed
messages: + msg96872

keywords: - needs review
resolution: accepted -> fixed
stage: commit review -> resolved
2009-12-21 18:34:06pitrousetresolution: accepted
messages: + msg96762
stage: patch review -> commit review
2009-12-21 17:40:36ezio.melottisetfiles: + issue6108-6.patch
resolution: accepted -> (no value)
messages: + msg96761

stage: commit review -> patch review
2009-12-21 16:05:58pitrousetresolution: accepted
messages: + msg96756
stage: patch review -> commit review
2009-12-21 16:00:29ezio.melottisetfiles: + issue6108-5.patch

messages: + msg96755
2009-12-21 11:54:50pitrousetmessages: + msg96742
2009-12-20 19:29:52ezio.melottisetfiles: + issue6108-4.patch

messages: + msg96717
2009-12-13 12:52:51pitrousetmessages: + msg96331
2009-12-13 06:01:54ezio.melottisetfiles: + issue6108-3.patch

messages: + msg96325
2009-12-13 04:56:44ezio.melottisetmessages: + msg96324
2009-12-13 04:51:43ezio.melottisetmessages: + msg96323
2009-12-13 04:49:12ncoghlansetmessages: + msg96322
2009-12-13 04:44:57rbcollinssetnosy: + rbcollins
messages: + msg96321
2009-12-13 04:39:31ezio.melottisetkeywords: + needs review
files: + issue6108-2.patch
messages: + msg96319

stage: needs patch -> patch review
2009-12-13 04:24:37ncoghlansetmessages: + msg96318
2009-12-13 01:52:48pitrousetmessages: + msg96315
2009-12-13 01:39:56ezio.melottisetmessages: + msg96314
2009-12-13 01:09:45ncoghlansetmessages: + msg96313
2009-12-12 15:53:07pitrousetnosy: + ncoghlan
messages: + msg96297
2009-12-12 03:21:58ezio.melottisetfiles: + issue6108.diff
assignee: ezio.melotti
messages: + msg96281

keywords: + patch
2009-11-12 05:25:54ezio.melottisetfiles: + unicode_exceptions.py
2009-11-12 05:24:18ezio.melottisetkeywords: - patch
files: + output_on_py26.txt
2009-11-12 05:23:02ezio.melottisetfiles: - unicode_exceptions.py
2009-11-12 05:22:50ezio.melottisetfiles: + issue6108_testcase.diff
priority: high -> release blocker
title: unicode(exception) behaves differently on Py2.6 when len(exception.args) > 1 -> unicode(exception) and str(exception) should return the same message on Py2.6
messages: + msg95158

keywords: + patch
2009-09-29 18:31:08barrysetpriority: release blocker -> high
nosy: + barry
messages: + msg93313

2009-09-16 19:59:17georg.brandllinkissue6890 superseder
2009-09-16 19:41:34georg.brandlsetpriority: high -> release blocker
2009-09-13 14:36:34pitrousetpriority: high
versions: + Python 2.7
nosy: + pitrou

messages: + msg92568

stage: needs patch
2009-09-13 05:00:41exarkunsetnosy: + exarkun
messages: + msg92561
2009-09-12 08:45:38cvrebertsetnosy: + cvrebert
2009-05-29 07:37:10georg.brandllinkissue5274 superseder
2009-05-26 04:53:13ezio.melotticreate