Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Printing Unicode chars from the interpreter in a non-UTF8 terminal raises an error (Py3) #49360

Closed
ezio-melotti opened this issue Jan 30, 2009 · 15 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 5110
Nosy @loewis, @atsuoishimoto, @giampaolo, @ezio-melotti, @merwok
Files
  • display_hook_ascii.patch
  • issue5110.txt: Some tests to show the effects of the patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-03-27.01:34:13.130>
    created_at = <Date 2009-01-30.15:26:47.830>
    labels = ['type-bug', 'invalid', 'expert-unicode']
    title = 'Printing Unicode chars from the interpreter in a non-UTF8 terminal raises an error (Py3)'
    updated_at = <Date 2010-08-07.11:44:42.424>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2010-08-07.11:44:42.424>
    actor = 'eric.araujo'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-03-27.01:34:13.130>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2009-01-30.15:26:47.830>
    creator = 'ezio.melotti'
    dependencies = []
    files = ['12894', '12900']
    hgrepos = []
    issue_num = 5110
    keywords = ['patch']
    message_count = 15.0
    messages = ['80820', '80822', '80823', '80824', '80825', '80826', '80845', '80852', '81056', '81059', '81841', '84248', '84965', '84986', '85372']
    nosy_count = 6.0
    nosy_names = ['loewis', 'atsuoi', 'ishimoto', 'giampaolo.rodola', 'ezio.melotti', 'eric.araujo']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5110'
    versions = ['Python 3.0']

    @ezio-melotti
    Copy link
    Member Author

    In Py2.x
    >>> u'\2620'
    outputs u'\2620' whereas
    >>> print u'\2620'
    raises an error.
    
    Instead, in Py3.x, both
    >>> '\u2620'
    and
    >>> print('\u2620')
    raise an error if the terminal doesn't use an encoding able to display
    the character (e.g. the windows terminal used for these examples).

    This is caused by the new string representation defined in the PEP-31381.

    Consider also the following example:
    Py2:
    >>> [u'\u2620']
    [u'\u2620']
    Py3:
    >>> ['\u2620']
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in
    position 9: character maps to <undefined>
    
    This means that there is no way to print lists (or other objects) that
    contain characters that can't be encoded.
    Two workarounds may be:
    1) encode all the elements of the list, but it's not practical;
    2) use ascii(), but it adds extra "" around the output and escape
    backslashes and apostrophes (and it won't be possible to use _[0] in the
    next line).
     
    Also note that in Py3
    >>> ['\ud800']
    ['\ud800']
    >>> _[0]
    '\ud800'
    works, because U+D800 belongs to the category "Cs (Other, Surrogate)"
    and it is escaped[2].
    
    The best solution is probably to change the default error-handler of the
    Python3 interactive interpreter to 'backslashreplace' in order to avoid
    this behavior, but I don't know if it's possible only for ">>> foo" and
    not for ">>> print(foo)" (print() should still raise an error as it does
    in Py2).

    This proposal has already been refused in the PEP-31383 but there are
    no links to the discussion that led to this decision.

    I think this should be rediscussed and possibly changed, because, even
    if can't see the "listOfJapaneseStrings"4, I still prefer to see a
    sequence of escaped chars than a UnicodeEncodeError.

    @ezio-melotti ezio-melotti added topic-unicode type-bug An unexpected behavior, bug, or error labels Jan 30, 2009
    @vstinner
    Copy link
    Member

    To be clear, this issue only affects the interpreter.

    1. use ascii(), but it adds extra "" around the output

    It doesn't ass extra "" if you replace repr() by ascii() in the
    interpreter code (sys.displayhook)?

    The best solution is probably to change the default error-handler
    of the Python3 interactive interpreter to 'backslashreplace'
    in order to avoid this behavior, (...)

    Hum, it implies that sys.stdout has a different behaviour in the
    interpreter and when running a script. We can expect many bugs ports
    from newbies "the example works in the terminal/IDLE, but not in my
    script, HELP!". So I prefer ascii().

    @vstinner
    Copy link
    Member

    You change change the display hook with a site.py script (which have
    to be in sys.path) :
    ---------

    import sys
    
    def hook(message):
        print(ascii(message))
    
    sys.displayhook = hook

    Example (run python in an empty environment to get ASCII charset):
    ---------

    $ env -i PYTHONPATH=$PWD ./python
    Python 3.1a0 (py3k:69105M, Jan 30 2009, 10:36:27)
    >>> import sys
    >>> sys.stdout.encoding
    'ANSI_X3.4-1968'
    >>> "\xe9"
    '\xe9'
    >>> print("\xe9")
    Traceback (most recent call last):
      (...)
    UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' (...)

    @ezio-melotti
    Copy link
    Member Author

    This seems to solve the problem, but apparently the interactive "_"
    doesn't work anymore.

    @vstinner
    Copy link
    Member

    Oh yeah, original sys.displayhook uses a special hack for the _ global
    variable:
    ---------

    import sys
    import builtins
    
    def hook(message):
        if message is None:
            return
        builtins._ = message
        print(ascii(message))
    
    sys.displayhook = hook

    @vstinner
    Copy link
    Member

    Here is a patch to use ascii() directly in sys_displayhook() (with an
    unit test!).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 31, 2009

    Victor, I'm not sure whether you are proposing that
    display_hook_ascii.patch is included into Python. IIUC, this patch
    breaks PEP-3138, so it clearly must be rejected.

    Overall, I fail to see the bug in this report. Python 3.0 works as
    designed as shown here.

    @ezio-melotti
    Copy link
    Member Author

    This seems to fix the problem:
    ------------------------------

    import sys
    import builtins
    
    def hook(message):
        if message is None:
            return
        builtins._ = message
        try:
            print(repr(message))
        except UnicodeEncodeError:
            print(ascii(message))
    
    sys.displayhook = hook

    Just to clarify:
    * The current Py3 behavior works fine in UTF8 terminals
    * It doesn't work on non-UTF8 terminals if they can't encode the chars
    (they raise an error)
    * It only affects the interactive interpreter
    * This new patch escapes the chars instead of raise an error only on
    non-UTF8 terminal and only when printed as ">>> foo" (without print())
    and leaves the other behaviors unchanged
    * This is related to Py3 only

    Apparently the patch provided by Victor always escapes the non-ascii
    chars. This new hook function prints the Unicode chars if possible and
    escapes them if not. On a UTF8 terminal the behavior is unchanged, on a
    non-UTF8 terminal all the chars that can not be encoded will now be escaped.

    This only changes the behavior of ">>> foo", so it can not lead to
    confusion ("It works in the interpreter but not in the script"). In a
    script one can't write "foo" alone but "print(foo)" and the behavior of
    "print(foo)" is the same in both the interpreter and the scripts (with
    the patch applied):
    
    >>> ['\u2620']
    ['\u2620']
    >>> print(['\u2620'])
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in
    position 2: character maps to <undefined>

    I think that the PEP-3138 didn't consider this issue. Its purpose is to
    have a better output (Unicode chars instead of escaped chars), but it
    only works with UTF8 terminals, on non-UTF8 terminals the output is
    worse (UnicodeEncodeError instead of escaped chars).

    This is an improvement and I can't see any negative side-effect.

    Attached there's a txt with more example, on Py2 and Py3, on
    Windows(non-UTF8 terminal) and Linux (UTF8 terminal), with and without
    my patch.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 3, 2009

    Victor, I'm not sure whether you are proposing that
    display_hook_ascii.patch is included into Python. IIUC, this patch
    breaks PEP-3138, so it clearly must be rejected.

    Overall, I fail to see the bug in this report. Python 3.0 works as
    designed as shown here.

    The idea is to avoid unicode error (by replacing not printable characters by
    their code in hexadecimal) when the display hook tries to display a message
    which is not printable in the terminal charset.

    It's just to make Python3 interpreter a little bit more "user friendly" on
    Windows.

    Problem: use different (encoding) rule for the display hook and for print() 
    may disturb new users (Why does ">>> chr(...)" work whereas ">>> 
    print(chr(...))" fails?).

    @ezio-melotti
    Copy link
    Member Author

    Problem: use different (encoding) rule for the display hook and for
    print() may disturb new users (Why does ">>> chr(...)" work whereas
    ">>> print(chr(...))" fails?).

    This is the same behavior that Python2.x has (with the only difference
    that Py2 always show the char as u'\uXXXX' if >0x7F whereas Py3 /tries/
    to display it):
    >>> unichr(0x0100)
    u'\u0100'
    >>> print unichr(0x0100)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\u0100' in
    position 0: character maps to <undefined>

    @ezio-melotti ezio-melotti changed the title Printing Unicode chars from the interpreter in a non-UTF8 terminal (Py3) Printing Unicode chars from the interpreter in a non-UTF8 terminal raises an error (Py3) Feb 3, 2009
    @ezio-melotti
    Copy link
    Member Author

    I've also noticed that if an error contains non-encodable characters,
    they are escaped:
    >>> raise ValueError("\u2620 can't be printed here, but '\u00e8' works
    fine!")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: \u2620 can't be printed here, but 'è' works fine!
    
    but:
    >>> "\u2620 can't be printed here, but '\u00e8' works fine!"
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in
    position 1: character maps to <undefined>
    
    The mechanism used to escape errors is even better than my patch,
    because it escapes only the chars that can't be encoded, instead of
    escaping every non-ascii chars when at least one char can't be encoded:
    >>> "\u2620 can't be printed here, but '\u00e8' works fine!"
    "\u2620 can't be printed here, but '\xe8' works fine!"

    I wonder if we can reuse the same mechanism here.

    By the way, the patch I proposed in msg80852 is just a proof of concept,
    if you think it's OK, someone will probably have to implement it in C.

    @vstinner
    Copy link
    Member

    martin> IIUC, this patch breaks PEP-3138,
    martin> so it clearly must be rejected.

    After reading the PEP-3138, it's clear that this issue is not bug, and
    that we can not accept any patch fixing the issue without breaking the
    PEP.

    Windows user who want to get the Python2 behaviour can use my display
    hook proposed in Message80823.

    We can not fix this issue, so I choose to close it. If anyone wants to
    change the PEP, start a discussion on python-dev first.

    @ezio-melotti
    Copy link
    Member Author

    In the first message I said that this breaks the PEP-3138 because I
    thought that the solution was to change the default error-handler to
    'backslashreplace', but this was already proposed and refused.

    sys.displayhook provides a way to change the behavior of the interactive
    interpreter only when ">>> foo" is used. The PEP doesn't seem to say
    anything about how ">>> foo" should behave.

    Moreover, in the alternate solutions 1 they considered to use
    sys.displayhook (and sys.excepthook) but they didn't because "these
    hooks are called only when printing the result of evaluating an
    expression entered in an interactive Python session, and doesn't work
    for the print() function, for non-interactive sessions or for
    logging.debug("%r", ...), etc."

    This is exactly the behavior I intended to have, and, being a unique
    feature of the interactive interpreter, it doesn't lead to inconsistence
    with other situations.

    @atsuoishimoto
    Copy link
    Mannequin

    atsuoishimoto mannequin commented Apr 1, 2009

    My proposal to make backslashreplace a default error handler
    for interactive session was rejected by Guido 1.

    Does something like
    PYTHONIOENCODING=ascii:backslashreplace
    work for you? With PYTHONIOENCODING, you can effectively make
    backslashreplace a default error handler for your environment.

    @ezio-melotti
    Copy link
    Member Author

    What I'm proposing is not to change the default error handler to
    'backslashreplace', but just the behavior of sys.displayhook.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants