Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when printing an exception containing a Unicode string #46769

Closed
christoph mannequin opened this issue Mar 30, 2008 · 36 comments
Closed

Error when printing an exception containing a Unicode string #46769

christoph mannequin opened this issue Mar 30, 2008 · 36 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@christoph
Copy link
Mannequin

christoph mannequin commented Mar 30, 2008

BPO 2517
Nosy @malemburg, @birkenfeld, @amauryfa, @ncoghlan, @pitrou, @benjaminp, @ezio-melotti
Files
  • unicode_exception_warning.patch
  • exception-unicode.diff: Patch implementing BaseException.unicode
  • tp_unicode_exception.patch: Patch to provide tp_unicode slot, and implementation for Exception
  • exception-unicode-with-type-fetch.diff: Implement Nigh Coghlan's suggestion from 67944.
  • exception-unicode-with-type-fetch-no-whitespace-changes.diff: Simon's patch with unneeded whitespace changes removed.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ncoghlan'
    closed_at = <Date 2008-07-08.14:15:53.841>
    created_at = <Date 2008-03-30.23:13:52.000>
    labels = ['type-bug', 'expert-unicode']
    title = 'Error when printing an exception containing a Unicode string'
    updated_at = <Date 2019-01-10.21:18:29.713>
    user = 'https://bugs.python.org/christoph'

    bugs.python.org fields:

    activity = <Date 2019-01-10.21:18:29.713>
    actor = 'piotr.dobrogost'
    assignee = 'ncoghlan'
    closed = True
    closed_date = <Date 2008-07-08.14:15:53.841>
    closer = 'ncoghlan'
    components = ['Unicode']
    creation = <Date 2008-03-30.23:13:52.000>
    creator = 'christoph'
    dependencies = []
    files = ['9915', '10559', '10562', '10580', '10585']
    hgrepos = []
    issue_num = 2517
    keywords = ['patch']
    message_count = 36.0
    messages = ['64770', '64771', '64779', '64781', '64782', '64786', '64793', '64794', '64795', '64797', '64798', '64802', '64807', '64866', '64876', '67863', '67865', '67867', '67868', '67869', '67870', '67874', '67875', '67944', '67946', '67947', '67950', '67974', '67980', '67984', '67985', '67994', '68394', '69384', '69436', '333419']
    nosy_count = 12.0
    nosy_names = ['lemburg', 'georg.brandl', 'amaury.forgeotdarc', 'ncoghlan', 'davidfraser', 'ggenellina', 'pitrou', 'benjamin.peterson', 'christoph', 'ezio.melotti', 'hodgestar', 'piotr.dobrogost']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2517'
    versions = ['Python 2.6', 'Python 2.5']

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Mar 30, 2008

    Python seems to have problems when an exception is thrown that
    contains non-ASCII text as a message and is converted to a string.

    >>> try:
    ...     raise Exception(u'Error when printing ü')
    ... except Exception, e:
    ...     print e
    ...
    Traceback (most recent call last):
      File "", line 4, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in 
    position 20:
    ordinal not in range(128)

    See
    http://www.stud.uni-karlsruhe.de/~uyhc/de/content/python-and-exceptions-containing-unicode-messages

    @christoph christoph mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Mar 30, 2008
    @benjaminp
    Copy link
    Contributor

    That is because Python encodes it's error messages as ASCII by default,
    and "ü" is not in ASCII. You can fix this by using "print
    unicode_msg.encode("utf-8")" or something similar.

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Mar 31, 2008

    To be more precise: I see no way to convert the encapsulated non-ASCII
    data from the string in an easy way.
    Taking e from my last post none of the following will work:
    str(e) # UnicodeDecodeError
    e.__str__() # UnicodeDecodeError
    e.__unicode__() # AttributeError
    unicode(e) # UnicodeDecodeError
    unicode(e, 'utf8') # TypeError

    My solution around this right now is raising an exception with an
    already converted string (see the link I provided).

    But as the tutorials speak of simply "print e" I guess the behaviour
    described above is some kind of a bug.

    @benjaminp
    Copy link
    Contributor

    Use: print unicode(e.message).encode("utf-8")

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Mar 31, 2008

    Thanks, this does work.

    But, where can I find the piece of information you just gave to me in
    the docs? I couldn't find any interface definition for Exceptions.

    Further more will this be regarded as a bug?
    From [1] I understand that "unicode(e)" and "unicode(e, 'utf8')" are
    supposed to work. No limitations are made on the type of the object.
    And I suppose that unicode() is the exact equivalent of str() in that
    it copes with unicode strings. Not expecting the string representation
    of an Exception to return a Unicode string when its content is
    non-ASCII where as this kind of behaviour of simple string conversion
    is wished for with ASCII text seems unlikely cumbersome.

    Please reopen if my report does have a point.

    [1] http://docs.python.org/lib/built-in-funcs.html

    @amauryfa
    Copy link
    Member

    Note the interpreter cannot print the exception either:

    >>> raise Exception(u'Error when printing ü')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    Exception>>>

    @benjaminp
    Copy link
    Contributor

    I am going to reopen this issue for Py3k. The recommended encoding for
    Python source files in 2.x is ASCII; I wouldn't say correctly dealing
    with non-ASCII exceptions is fully supported. In 3.x, however, the
    recommended encoding is UTF-8, so this should work.

    In Py3k,
    str(e) # str is unicode in Py3k
    does work correctly, and that'll have to be used because the message
    attribute is gone is 3.x.
    However, the problem Amaury pointed out is not fixed. Exceptions that
    cannot encoding into ASCII are silently not printed. I think a warning
    should at least be printed.

    @benjaminp benjaminp reopened this Mar 31, 2008
    @benjaminp benjaminp removed the invalid label Mar 31, 2008
    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Mar 31, 2008

    Though I welcome the reopening of the bug for Python 3.0 I must say
    that plans of not fixing a core element rather surprises me.

    I never believed Python to be a programming language with good Unicode
    integration. Several points were missing that would've been nice or
    even essential to have for good development with Unicode, most ignored
    for the sake of maintaining backward compatibility. This though is not
    the fault of the Unicode class itself and supporting packages.

    Some modules like the one for CSV are lacking full Unicode support.
    But nevertheless the basic Python would always give you the
    possibility to use Unicode in (at least) a consistent way. For me
    raising exceptions does count as basic support like this.

    So I still hope to see this solved for the 2.x versions which I read
    will be maintained even after the release of 3.0.

    @benjaminp
    Copy link
    Contributor

    I never believed Python to be a programming language with good Unicode
    integration. Several points were missing that would've been nice or
    even essential to have for good development with Unicode, most ignored
    for the sake of maintaining backward compatibility. This though is not
    the fault of the Unicode class itself and supporting packages.
    Many (including myself) agree with you. That's pretty much the whole
    point of Py3k. We want to fix the Python "warts" which can only be fixed
    by breaking backwards compatibility.

    @amauryfa
    Copy link
    Member

    Even in 2.5, __str__ is allowed to return a Unicode object;
    we could change BaseException_str this way:

    Index: exceptions.c
    ===================================================================

    --- exceptions.c	(revision 61957)
    +++ exceptions.c	(working copy)
    @@ -108,6 +104,11 @@
             break;
         case 1:
             out = PyObject_Str(PyTuple_GET_ITEM(self->args, 0));
    +        if (out == NULL &&
    PyErr_ExceptionMatches(PyExc_UnicodeEncodeError))
    +        {
    +            PyErr_Clear();
    +            out = PyObject_Unicode(PyTuple_GET_ITEM(self->args, 0));
    +        }
             break;
         default:
             out = PyObject_Str(self->args);

    Then str(e) still raises UnicodeEncodeError,
    but unicode(e) returns the original message.

    But I would like the opinion of an experimented core developer...

    @benjaminp
    Copy link
    Contributor

    After thinking some more, I'm going to add 2.6 to this. I'm attaching a
    patch for the trunk (it can be merged in Py3k, and maybe 2.5) which
    displays a UnicodeWarning when an Exception cannot be displayed due to
    encoding issues.

    Georg, can you review Amaury's and my patches? Also, would mine be a
    candidate for 2.5 backporting?

    @pitrou
    Copy link
    Member

    pitrou commented Apr 1, 2008

    Shouldn't it be an exception rather than a warning? The fact that an
    exception can be downgraded to a warning (and thus involuntarily
    silenced) is a bit disturbing IMHO.

    Another possibility would be to display the warning, and *then* to
    encode the exception message again in "replace" or "ignore" mode rather
    than "strict" mode. That way exception messages are always displayed,
    but not always properly. The ASCII part of the message is generally
    useful, since it gives the exception name and most often the reason too.

    @benjaminp
    Copy link
    Contributor

    Have you looked at PyErr_Display? There are many, many possible
    exceptions, and it ignores them all because "too many callers rely on
    this." So, I think all we can do is warn. I will look into encoding the
    message differently.

    @christoph
    Copy link
    Mannequin Author

    christoph mannequin commented Apr 2, 2008

    JFTR:

    print unicode(e.message).encode("utf-8")
    only works for Python 2.5, not downwards.

    @benjaminp
    Copy link
    Contributor

    We can't do much about that because only security fixes are backported
    to version < 2.5.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 9, 2008

    One of the examples Christoph tried was

      unicode(Exception(u'\xe1'))

    which fails quite oddly with:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in
    position 0: ordinal not in range(128)

    The reason for this is Exception lacks an __unicode__ method
    implementation so that unicode(e) does something like unicode(str(e))
    which attempts to convert the exception arguments to the default
    encoding (almost always ASCII) and fails.

    Fixing this seems quite important. It's common to want to raise errors
    with non-ASCII characters (e.g. when the data which caused the error
    contains such characters). Usually the code raising the error has no way
    of knowing how the characters should be encoded (exceptions can end up
    being written to log files, displayed in web interfaces, that sort of
    thing). This means raising exceptions with unicode messages. Using
    unicode(e.message) is unattractive since it won't work in 3.0 and also
    does not duplicate str(e)'s handling of the other exception __init__
    arguments.

    I'm attaching a patch which implements __unicode__ for BaseException.
    Because of the lack of a tp_unicode slot to mirror tp_str slot, this
    breaks the test that calls unicode(Exception). The existing test for
    unicode(e) does unicode(Exception(u"Foo")) which is a bit of a non-test.
    My patch adds a test of unicode(Exception(u'\xe1')) which fails without
    the patch.

    A quick look through trunk suggests implementing tp_unicode actually
    wouldn't be a huge job. My worry is that this would constitute a change
    to the C API for PyObjects and has little chance of acceptance into 2.6
    (and in 3.0 all these issues disappear anyway). If there is some chance
    of acceptance, I'm willing to write a patch that adds tp_unicode.

    @davidfraser
    Copy link
    Mannequin

    davidfraser mannequin commented Jun 9, 2008

    Aha - the __unicode__ method was previously there in Python 2.5, and was
    ripped out because of the unicode(Exception) problem. See
    http://bugs.python.org/issue1551432.

    The reversion is in
    http://svn.python.org/view/python/trunk/Objects/exceptions.c?rev=51837&r1=51770&r2=51837

    @benjaminp
    Copy link
    Contributor

    On Mon, Jun 9, 2008 at 8:40 AM, Simon Cross <report@bugs.python.org> wrote:

    Simon Cross <hodgestar@gmail.com> added the comment:

    One of the examples Christoph tried was

    unicode(Exception(u'\xe1'))

    which fails quite oddly with:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in
    position 0: ordinal not in range(128)

    The reason for this is Exception lacks an __unicode__ method
    implementation so that unicode(e) does something like unicode(str(e))
    which attempts to convert the exception arguments to the default
    encoding (almost always ASCII) and fails.

    What version are you using? In Py3k, str is unicode so __str__ can
    return a unicode string.

    Fixing this seems quite important. It's common to want to raise errors
    with non-ASCII characters (e.g. when the data which caused the error
    contains such characters). Usually the code raising the error has no way
    of knowing how the characters should be encoded (exceptions can end up
    being written to log files, displayed in web interfaces, that sort of
    thing). This means raising exceptions with unicode messages. Using
    unicode(e.message) is unattractive since it won't work in 3.0 and also
    does not duplicate str(e)'s handling of the other exception __init__
    arguments.

    I'm attaching a patch which implements __unicode__ for BaseException.
    Because of the lack of a tp_unicode slot to mirror tp_str slot, this
    breaks the test that calls unicode(Exception). The existing test for
    unicode(e) does unicode(Exception(u"Foo")) which is a bit of a non-test.
    My patch adds a test of unicode(Exception(u'\xe1')) which fails without
    the patch.

    A quick look through trunk suggests implementing tp_unicode actually
    wouldn't be a huge job. My worry is that this would constitute a change
    to the C API for PyObjects and has little chance of acceptance into 2.6
    (and in 3.0 all these issues disappear anyway). If there is some chance
    of acceptance, I'm willing to write a patch that adds tp_unicode.

    Email Python-dev for permission.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 9, 2008

    Concerning http://bugs.python.org/issue1551432:

    I'd much rather have working unicode(e) than working unicode(Exception).
    Calling unicode(C) on any class C which overrides __unicode__ is broken
    without tp_unicode anyway.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 9, 2008

    Benjamin Peterson wrote:

    What version are you using? In Py3k, str is unicode so __str__ can
    return a unicode string.

    I'm sorry it wasn't clear. I'm aware that this issue doesn't apply to
    Python 3.0. I'm testing on both Python 2.5 and Python 2.6 for the
    purposes of the bug.

    Code I'm developing that hits these issues are database exceptions with
    unicode messages raised inside MySQLdb on Python 2.5.

    The patch I submitted is against trunk.

    @malemburg
    Copy link
    Member

    Removing 3.0 from the versions list.

    @davidfraser
    Copy link
    Mannequin

    davidfraser mannequin commented Jun 9, 2008

    So I've got a follow-up patch that adds tp_unicode.
    Caveat that I've never done anything like this before and it's almost
    certain to be wrong.

    It does however generate the desired result in this case :-)

    @benjaminp
    Copy link
    Contributor

    On Mon, Jun 9, 2008 at 2:04 PM, David Fraser <report@bugs.python.org> wrote:

    David Fraser <davidf@sjsoft.com> added the comment:

    So I've got a follow-up patch that adds tp_unicode.
    Caveat that I've never done anything like this before and it's almost
    certain to be wrong.

    Unfortunately, adding a slot is a bit more complicated. You have to
    deal with inheritance and such. Have a look in typeobject.c for all
    the gory details. I'd recommend you write to python-dev before going
    on the undertaking, though.

    It does however generate the desired result in this case :-)

    Added file: http://bugs.python.org/file10562/tp_unicode_exception.patch


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue2517\>


    @ncoghlan
    Copy link
    Contributor

    As far as I am concerned, the implementation of PyObject_Unicode in
    object.c has a bug in it: it should NEVER be retrieving __unicode__ from
    the instance object. The implementation of PyObject_Format in abstract.c
    shows the correct way to retrieve a pseudo-slot method like __unicode__
    from an arbitrary object.

    Line 482 in object.c is the offending line:
    func = PyObject_GetAttr(v, unicodestr);

    Fix that bug, then add a __unicode__ method back to Exception objects
    and you will have the best of both worlds.

    @malemburg
    Copy link
    Member

    On 2008-06-11 11:32, Nick Coghlan wrote:

    Nick Coghlan <ncoghlan@gmail.com> added the comment:

    As far as I am concerned, the implementation of PyObject_Unicode in
    object.c has a bug in it: it should NEVER be retrieving __unicode__ from
    the instance object. The implementation of PyObject_Format in abstract.c
    shows the correct way to retrieve a pseudo-slot method like __unicode__
    from an arbitrary object.

    The only difference I can spot is that the PyObject_Format() code
    special cases non-instance objects.

    Line 482 in object.c is the offending line:
    func = PyObject_GetAttr(v, unicodestr);

    Fix that bug, then add a __unicode__ method back to Exception objects
    and you will have the best of both worlds.

    I'm not sure whether that would really solve anything.

    IMHO, it's better to implement the tp_unicode slot and then
    check that before trying .__unicode__ (as mentioned in the comment
    in PyObject_Unicode()).

    @ncoghlan
    Copy link
    Contributor

    Here's the key difference with the way PyObject_Format looks up the
    pseudo-slot method:

    		PyObject *method = _PyType_Lookup(Py_TYPE(obj),
    						  str__format__);

    _PyType_Lookup instead of PyObject_GetAttr - so unicode(Exception) would
    only look for type.__unicode__ and avoid getting confused by the utterly
    irrelevant Exception.__unicode__ method (which is intended only for
    printing Exception instances, not for printing the Exception type itself).

    You then need the PyInstance_Check/PyObject_GetAttr special case for
    retrieving the bound method because _PyType_Lookup won't work on classic
    class instances.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 11, 2008

    Attached a patch which implements Nick Coghlan's suggestion. All
    existing tests in test_exceptions.py and test_unicode.py pass as does
    the new unicode(Exception(u"\xe1")) test.

    @ncoghlan
    Copy link
    Contributor

    Minor cleanup of Simon's patch attached - aside from a couple of
    unneeded whitespace changes, it all looks good to me.

    Not checking it in yet, since it isn't critical for this week's beta
    release - I'd prefer to leave it until after that has been dealt with.

    @malemburg
    Copy link
    Member

    On 2008-06-11 16:15, Nick Coghlan wrote:

    Nick Coghlan <ncoghlan@gmail.com> added the comment:

    Minor cleanup of Simon's patch attached - aside from a couple of
    unneeded whitespace changes, it all looks good to me.

    Not checking it in yet, since it isn't critical for this week's beta
    release - I'd prefer to leave it until after that has been dealt with.

    Added file: http://bugs.python.org/file10585/exception-unicode-with-type-fetch-no-whitespace-changes.diff

    That approach is fine as well.

    I still like the idea to add a tp_unicode slot, though, since that's
    still missing for C extension types to benefit from.

    Perhaps we can have both ?!

    @ncoghlan
    Copy link
    Contributor

    I'm not sure adding a dedicated method slot would be worth the hassle
    involved - Py3k drop backs to just the tp_str slot anyway, and the only
    thing you gain with a tp_unicode slot over _PyType_Lookup of a
    __unicode__ attribute is a small reduction in memory usage and a slight
    speed increase.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 11, 2008

    Re msg67974:

    Minor cleanup of Simon's patch attached - aside from a couple of
    unneeded whitespace changes, it all looks good to me.

    Not checking it in yet, since it isn't critical for this week's beta
    release - I'd prefer to leave it until after that has been dealt with.

    Thanks for the clean-up, Nick. The mixture of tabs and spaces in the
    current object.c was unpleasant :/.

    @malemburg
    Copy link
    Member

    On 2008-06-11 16:49, Nick Coghlan wrote:

    Nick Coghlan <ncoghlan@gmail.com> added the comment:

    I'm not sure adding a dedicated method slot would be worth the hassle
    involved - Py3k drop backs to just the tp_str slot anyway, and the only
    thing you gain with a tp_unicode slot over _PyType_Lookup of a
    __unicode__ attribute is a small reduction in memory usage and a slight
    speed increase.

    AFAIK, _PyType_Lookup will only work for base types, ie. objects
    subclassing from object. C extension types often do not inherit from
    object, since the attribute access mechanisms and object creation
    are a lot simpler when not doing so.

    @hodgestar
    Copy link
    Mannequin

    hodgestar mannequin commented Jun 19, 2008

    Justing prodding the issue again now that the betas are out. What's the
    next step?

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 7, 2008

    Adding this to my personal to-do list for the next beta release.

    @ncoghlan ncoghlan assigned ncoghlan and unassigned birkenfeld Jul 7, 2008
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Jul 8, 2008

    Fixed in 64791.

    Blocked from being merged to Py3k (since there is no longer a
    __unicode__ special method).

    For MAL: the PyInstance_Check included in the patch for the benefit of
    classic classes defined in Python code also covers all of the classic C
    extension classes which are not instances of object.

    @ncoghlan ncoghlan closed this as completed Jul 8, 2008
    @piotrdobrogost
    Copy link
    Mannequin

    piotrdobrogost mannequin commented Jan 10, 2019

    Benjamin Peterson in comment https://bugs.python.org/issue2517#msg64771 wrote:

    "That is because Python encodes it's error messages as ASCII by default…"

    Could somebody please point where in the source code of Python 2 this happens?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants