Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle shell crash on printing non-BMP unicode character #58408

Closed
vbr mannequin opened this issue Mar 5, 2012 · 31 comments
Closed

Idle shell crash on printing non-BMP unicode character #58408

vbr mannequin opened this issue Mar 5, 2012 · 31 comments
Assignees
Labels
topic-IDLE topic-tkinter topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@vbr
Copy link
Mannequin

vbr mannequin commented Mar 5, 2012

BPO 14200
Nosy @loewis, @terryjreedy, @vstinner, @ned-deily, @ezio-melotti, @serwy, @asvetlov
Files
  • unicodeerror.diff
  • rpc_marshal_exception.patch
  • unicodeerror_rev1.diff
  • issue14200.patch
  • issue14200_rev1.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/asvetlov'
    closed_at = <Date 2012-03-31.12:15:42.421>
    created_at = <Date 2012-03-05.12:39:35.965>
    labels = ['expert-IDLE', 'expert-tkinter', 'expert-unicode', 'type-crash']
    title = 'Idle shell crash on printing non-BMP unicode character'
    updated_at = <Date 2012-03-31.12:15:42.420>
    user = 'https://bugs.python.org/vbr'

    bugs.python.org fields:

    activity = <Date 2012-03-31.12:15:42.420>
    actor = 'asvetlov'
    assignee = 'asvetlov'
    closed = True
    closed_date = <Date 2012-03-31.12:15:42.421>
    closer = 'asvetlov'
    components = ['IDLE', 'Tkinter', 'Unicode']
    creation = <Date 2012-03-05.12:39:35.965>
    creator = 'vbr'
    dependencies = []
    files = ['24748', '24788', '24790', '24848', '24849']
    hgrepos = []
    issue_num = 14200
    keywords = ['patch']
    message_count = 31.0
    messages = ['154944', '154961', '154965', '154967', '154996', '155004', '155009', '155032', '155410', '155421', '155426', '155428', '155429', '155789', '155794', '155805', '155807', '155813', '155817', '155844', '155851', '155898', '155922', '155927', '155930', '155931', '155933', '155943', '156744', '156767', '157182']
    nosy_count = 9.0
    nosy_names = ['loewis', 'terry.reedy', 'vstinner', 'vbr', 'ned.deily', 'ezio.melotti', 'roger.serwy', 'asvetlov', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue14200'
    versions = ['Python 3.3']

    @vbr
    Copy link
    Mannequin Author

    vbr mannequin commented Mar 5, 2012

    Hi,
    while testing python 3.3a1 a bit, especially the new string handling of non-BMP characters, I noticed a problem in Idle in this regard:

    Python 3.3.0a1 (default, Mar 4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on win32 ...
    [using win XPp SP3 Czech]

    >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
    >>> len(got_ahsa)
    1
    >>> got_ahsa.encode("unicode-escape")
    b'\\U00010330'
    >>> got_ahsa

    [crash - idle shell window closes immediately without any visible error message or traceback]

    I realised later, that tkinter probably won't be able to print wide-unicode characters anyway (according to
    http://bugs.python.org/issue12342 ), but Idle should probably just print the exception introduced there, e.g.
    ValueError: character U+10330 is above the range (U+0000-U+FFFF) allowed by Tcl

    Regards
    vbr

    @ezio-melotti ezio-melotti added the type-crash A hard crash of the interpreter, possibly with a core dump label Mar 5, 2012
    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 5, 2012

    Hi Vlastimil,

    Can you repeat your test case while running IDLE from the command prompt and report the error you see?

    python -m idlelib.idle
    

    IDLE closes suddenly on Windows because IDLE uses pythonw.exe which has no stdout or stderr. When Tkinter encounters an error and tries to write to stderr, an error is raised in the Tkinter eventloop and the eventloop terminates.

    @vbr
    Copy link
    Mannequin Author

    vbr mannequin commented Mar 5, 2012

    Hi,
    thanks for the pointer, after invoking idle using python.exe, I don't see the crash mentioned in the report:

    Python 3.3.0a1 (default, Mar  4 2012, 17:27:59) [MSC v.1500 32 bit (Intel)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
    >>> len(got_ahsa)
    1
    >>> got_ahsa.encode("unicode-escape")
    b'\\U00010330'
    >>> got_ahsa

    >> print(got_ahsa)

    >>

    I just get empty line as "answer" but no crash.

    The console indeed contains the traceback with the error I expected

    vbr

    ============

    Microsoft Windows XP [Verze 5.1.2600]
    (C) Copyright 1985-2001 Microsoft Corp.

    C:\Python33>python.exe -m idlelib.idle
    *** Internal Error: rpc.py:SocketIO.localcall()

    Object: stdout
    Method: <bound method PseudoFile.write of <idlelib.PyShell.PseudoFile object at
    0x01CDDB50>>
    Args: ("'\U00010330'",)

    Traceback (most recent call last):
      File "C:\Python33\lib\idlelib\rpc.py", line 188, in localcall
        ret = method(*args, **kwargs)
      File "C:\Python33\lib\idlelib\PyShell.py", line 1244, in write
        self.shell.write(s, self.tags)
      File "C:\Python33\lib\idlelib\PyShell.py", line 1226, in write
        OutputWindow.write(self, s, tags, "iomark")
      File "C:\Python33\lib\idlelib\OutputWindow.py", line 40, in write
        self.text.insert(mark, s, tags)
      File "C:\Python33\lib\idlelib\Percolator.py", line 25, in insert
        self.top.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\ColorDelegator.py", line 80, in insert
        self.delegate.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\PyShell.py", line 322, in insert
        UndoDelegator.insert(self, index, chars, tags)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 81, in insert
        self.addcmd(InsertCommand(index, chars, tags))
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 116, in addcmd
        cmd.do(self.delegate)
      File "C:\Python33\lib\idlelib\UndoDelegator.py", line 219, in do
        text.insert(self.index1, self.chars, self.tags)
      File "C:\Python33\lib\idlelib\ColorDelegator.py", line 80, in insert
        self.delegate.insert(index, chars, tags)
      File "C:\Python33\lib\idlelib\WidgetRedirector.py", line 104, in __call__
        return self.tk_call(self.orig_and_operation + args)
    ValueError: character U+10330 is above the range (U+0000-U+FFFF) allowed by Tcl

    @terryjreedy
    Copy link
    Member

    On 3.2.2, Win7, the length is 2 and printing in Idle prints a square, as it usually does for chars it cannot print. I presume Tk recognizes surrogate pairs. Printing to the screen should not raise an exception, so the square would be better. Even better would be to do what the 3.2 and 3.3 Command Prompt Interpreters do, which is to print an evaluable representation:

    >>> c
    '\U00010330'

    I assume that this string is produced by python.exe rather than Windows. If so, neither of the two pythonw processes is currently doing the same conversion. My understanding is that the user pythonw process uses idlelib.rpc.RPCproxy objects to ship i/o calls to the idle pythonw process.

    I presume we could find the idle process window .write methods and change lines like
    self.text.insert(mark, s, tags)
    to
    try:
    self.text.insert(mark, s, tags)
    except SomeTkError:
    self.text.insert(mark, expand(s), tags)
    But it seems to me that the expansion should really be done in C in _tkinter, where the internal .kind attribute of strings is available.

    ---
    There is also an input crash. On 3.2, I tried to cut the square char and paste it into "ord('')" (both shell and edit window) to see what unicode char it is and IDLE fades away as you describe. That puzzles me, as I am normally able to paste BMP chars into idle without problem. In any case, I presume the problem is not idle-specific and would best be handled in _tkinter. Or does the crash happen in Windows or tcl/tk code before _tkinter ever sees the input?

    When I paste the same into the 3.2 or 3.2 interpreter, it is converted to ascii '?'. I presume this is done by Windows Command Prompt before sending anything to python.

    @vbr
    Copy link
    Mannequin Author

    vbr mannequin commented Mar 6, 2012

    I'd like to add some further observations to the mentioned issue;
    it seems, that the crash is indeed not specific to idle.
    In a sample tkinter app, where I just display e.g. chr(66352) in an Entry widget, I also get the same immediate crash via pythonw.exe and the previously mentioned "proper" ValueError without a crash with python.exe.

    I also tried to explicitly display surrogate pair, which were used automatically until python 3.2; these can be used in tkinter in 3.3, but there are limitations and discrepancies:

    >>> 
    >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
    >>> def wide_char_to_surrog_pair(char):
        code_point = ord(char)
        if code_point <= 0xFFFF:
            return char
        else:
            high_surr = (code_point - 0x10000) // 0x400 + 0xD800
            low_surr = (code_point - 0x10000) % 0x400 + 0xDC00
            return chr(high_surr)+chr(low_surr)
    
    >>> ahsa_surrog = wide_char_to_surrog_pair(got_ahsa)
    >>> print(ahsa_surrog)
    𐌰
    >>> repr(ahsa_surrog)
    "'_ud800\x00udf30'"
    >>> ahsa_surrog
    'Pud800 udf30'

    [the space in the middle of the last item might be \x00, as it terminates the clipboard content, the rest is copied separately]

    the printed square corresponds with the given character and can be used in other programs etc. (whereas in py 3.2, the same value was used for repr and a direct "display" of the string in the interpreter, there are three different formats in py 3.3.

    I also noticed that surogate pair is not supported as input for unicodedata.name(...) anymore:
     
    >>> import unicodedata
    >>> unicodedata.name(ahsa_surrog)
    Traceback (most recent call last):
      File "<pyshell#60>", line 1, in <module>
        unicodedata.name(ahsa_surrog)
    TypeError: need a single Unicode character as parameter
    >>> 

    (in 3.2 and probably others it returns the expected 'GOTHIC LETTER AHSA')

    (I for my part would think, that e.g. keeping a bit liberal (but still non-ambiguous) input possibilities for unicodedata wouldn't hurt. Also, if tkinter is not going to support wide unicode natively any time soon, the output conversion using surrogates, which are also understandable for other programs, seems the most usable option in this regard.

    Hopefully, this is somehow relevant for the original issue -
    I am somehow not sure, whether some parts would be better posted as separate issues, or whether this is the planned and expected behaviour anyway.

    regards,
    vbr

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 6, 2012

    Vlastimil: you are mixing issues. Some of your observations are actually correct behaviour; please don't clutter the report with that, but report each separate behavior in a separate report. In Python 3.3, surrogate pairs do *not* substitute for the the actual character, since the internal representation is not UTF-16 anymore.

    Also, when you run a Tkinter app in IDLE: while you get a "proper" traceback output, your conclusion that python.exe does not "crash" is incorrect: it crashes just in the very same way that IDLE crashes. Except when run inside IDLE, it is a subprocess that "crashes" (i.e. terminates with a traceback output), not IDLE itself.

    @loewis loewis mannequin closed this as completed Mar 6, 2012
    @loewis loewis mannequin reopened this Mar 6, 2012
    @vbr
    Copy link
    Mannequin Author

    vbr mannequin commented Mar 6, 2012

    Sorry for mixing the different problems, these were somehow things I noticed "at once" in the new python version, but I should have noticed the different domains myself.
    I still might not understand the term "crash" properly - I just meant to distinguish between a single appropriate exception on an invalid operation (while the app is staying alive and works on next valid input) - as is the case with calling through python.exe, and - on the other hand - the immediate termination on encountering the invalid input, which happens with pythonw.exe.

    Now I see, that with pythonw a tk app terminates with the first exception (in general) in py 3.3 and also 3.2 (as opposed to py 2.7, where it just swallows the exception and stays alive, as one would probably expect).

    Should this be reported in a separate issue, or is this what remains relevant in *this* report? (Sorry for the confusion.)

    vbr

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 6, 2012

    That pythonw suddenly closes is a separate issue: if pythonw attempts to write to stderr, it crashes. To get your example to "run" in pythonw.exe,
    try

    pythonw.exe Lib\idlelib\idle.py 2> out.txt

    I think the behavior of pythonw terminating when it can't write to stderr is actually correct: an exception is raised on attempting to write to stderr, which then can be printed (because there is no stderr).

    So the real fault here is the traceback that python.exe reports.

    To fix this, I think rpc.py should learn to marshal exceptions back to the subprocess. Then the initial sys.stdout.write should raise a UnicodeError (which it currently doesn't, either). This would get into the displayhook, which would then run use sys_displayhook_unencodable to backslashescape the unsupported character.

    I'll attach a patch that at least makes the exception UnicodeEncodeError.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 11, 2012

    Attached is a patch to have the rpc marshal exceptions. When used with Martin's patch, IDLE returns

    >>> '\U00010330'
    Traceback (most recent call last):
      File "<pyshell#3>", line 1, in <module>
        '\U00010330'
    ValueError: character U+10330 is above the range (U+0000-U+FFFF) allowed by Tcl

    Martin: I disagree with the approach of raising a UnicodeEncodeError if IDLE can't render the output of a user's program, especially when the program would otherwise run without error if ran from outside of IDLE.

    Would replacing these characters with "?" and documenting this limitation in IDLE's docs be an acceptable solution?

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 11, 2012

    I made a mistake in msg155410. The results in the message are WITHOUT "unicodeerror.diff" applied. When it is applied, the IDLE shell gives:

    >>> '\U00010330'
    Traceback (most recent call last):
      File "<pyshell#1>", line 1, in <module>
        '\U00010330'
    UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk
    Traceback (most recent call last):
    ** IDLE Internal Exception: 
      File "idlelib/run.py", line 98, in main
        ret = method(*args, **kwargs)
      File "idlelib/run.py", line 305, in runcode
        print_exception()
      File "idlelib/run.py", line 168, in print_exception
        print(line, end='', file=efile)
      File "idlelib/rpc.py", line 599, in __call__
        value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
      File "idlelib/rpc.py", line 214, in remotecall
        return self.asyncreturn(seq)
      File "idlelib/rpc.py", line 245, in asyncreturn
        return self.decoderesponse(response)
      File "idlelib/rpc.py", line 265, in decoderesponse
        raise what
    ValueError: max() arg is an empty sequence

    I will need to rework the rpc_marshal_exception patch.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 12, 2012

    Martin: I disagree with the approach of raising a UnicodeEncodeError
    if IDLE can't render the output of a user's program, especially when
    the program would otherwise run without error if ran from outside of
    IDLE.

    This is really an independent issue, and I'd appreciate if people would
    treat it as such. *This* issue is about IDLE crashing, not about how
    Tkinter deals with non-BMP characters.

    So if the RPC exception marshalling works, and can resolve this issue,
    I'll be ready to commit this and close this issue. Opening another issue
    dealing with the more general Tk problem would be fine with me.

    I don't *quite* understand what you are proposing. If it is that
    Tkinter always replaces non-BMP characters in string objects with
    question marks, then I'm opposed. Tkinter can't know whether the
    replacement is an acceptable loss or not; errors should never pass
    silently.

    If you are suggesting that IDLE's write function should write
    a question mark instead of raising an exception: perhaps, but
    a) I'd rather use REPLACEMENT CHARACTER instead of QUESTION MARK
    b) I'd really try to find out first whether Tcl unknowingly
    supports UTF-16, at least for rendering.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 12, 2012

    Having had some time to work on it, the bug is in the unicodeerror.diff patch. If the string is empty then max(s) will raise a ValueError. This is easy to trigger by generating an exception at the python prompt, like "1/0".

    Attached is a revised version of Martin's patch.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 12, 2012

    Martin, I got your message after I submitted the last one.

    This issue does involve IDLE crashing, but it's not crashing due to non-BMP characters. That is a side-effect of a bigger issue with pythonw.exe. See bpo-13582 for more information.

    IDLE's shell output has a gross deficiency due to Tkinter's inability to handle Unicode properly. Why penalize a program for running in IDLE just because IDLE can't write something to the text widget? This is precisely what your approach is doing - making IDLE an even more restricted environment than it needs to be.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 14, 2012

    New changeset c06b94c5c609 by Andrew Svetlov in branch 'default':
    Issue bpo-14200: Idle shell crash on printing non-BMP unicode character.
    http://hg.python.org/cpython/rev/c06b94c5c609

    @asvetlov
    Copy link
    Contributor

    Patch escapes avery non-ascii char while better to escape only non-BMP.

    Will be done after bpo-14304

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 14, 2012

    Andrew, please reopen this issue. Your committed patch does not work if IDLE is not using the subprocess.

        >>> got_ahsa = "\N{GOTHIC LETTER AHSA}"
        >>> got_ahsa
        Traceback (most recent call last):
          File "<pyshell#1>", line 1, in <module>
            got_ahsa
          File "idlelib/PyShell.py", line 1255, in write
            return self.shell.write(s, self.tags)
          File "idlelib/PyShell.py", line 1233, in write
            'Non-BMP character not supported in Tk')
    UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 1-1: Non-BMP character not supported in Tk

    However, it does work when IDLE uses a subprocess.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 14, 2012

    Attached is a patch to undo Andrew's and fixes the issue in a simple manner. The tcl_unicode_range.patch from bpo-12342 has already been applied, so catching ValueError within IDLE is all that is now needed.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 14, 2012

    Attached is a better implementation of the patch. The Percolator which ultimately handles writing to the Text widget should intercept the ValueError due to non-BMP characters. The issue14200_rev1.patch fixes this issue and bpo-13153.

    @serwy serwy mannequin reopened this Mar 14, 2012
    @asvetlov
    Copy link
    Contributor

    Roger, you are missing the difference between calling print() and evaluating expression in python interactive mode.
    While later should be unicode escaped the former should to raise error — we need to follow the same way as console python interactive session does.

    For the rest I like your simplification. And definitelly IDLE should to work both in subprocess and embedded modes — thank you for that point.

    I'll make the final (I hope) patch a bit later.

    @asvetlov asvetlov self-assigned this Mar 14, 2012
    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 15, 2012

    Andrew, I do admit that I have a lot to learn about Unicode support in Python, for instance with its error-handling and its corner cases.

    On Windows Vista, I do see that print() behaves differently than evaluating the expression. An exception is raised for:
    print('\N{GOTHIC LETTER AHSA}')

    On Linux, I see the character print as ? in xterm and as a '?' when evaluated. In gnome-terminal (Ubuntu Mono font) it prints as a box containing the code point in hex. No exception is raised.

    I do see your point. The patch I provided always substitutes the unsupported character with its full expansion. Returning to a point earlier raised by Martin, using REPLACEMENT CHARACTER instead would be better. It would make the behavior of IDLE more consistent with xterm and gnome-terminal, although it would cause IDLE to hide errors if the program ran from a Windows console instead of IDLE.

    Given that Windows and Linux (Ubuntu) behave differently, I'd rather let IDLE mimic the behavior of a Linux console than a Windows console.

    @asvetlov
    Copy link
    Contributor

    I consulted with Martin at PyCon sprint and he suggested sulution which I'm following — to split print and REPL (read-eval-print loop).

    Output passed to print() function encoded with sys.stdout.encoding

    UTF has been invented to support any character.
    Linux usually setted up to use utf-8 encoding by default (see LANG environment variable). There are no encoding issues with that.

    xterm (old enough terminal) which you use cannot print non-BMP characters and replaces it with question marks.
    Modern gnome-terminal prints that symbols very well.

    Let's return to non-UTF terminal encodings.
    If character cannot be encoded Python throws UnicodeEncodeError.
    There's example:

    andrew@tiktaalik ~/p/cpython> bash -c "LANG=C; ./python"
    Python 3.3.0a1+ (qbase qtip tip tk:c3ce8a8e6c9c+, Mar 14 2012, 15:54:55) 
    [GCC 4.6.1] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> '\U00010340'
    '\U00010340'
    >>> print('\U00010340')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character '\U00010340' in position 0: ordinal not in range(128)
    >>> 

    As you can see I have switched LANG to C (alias for ASCII) locale.

    Eval printed with unicode escaping but print call raises error.
    This happens because python's REPL calls sys.displayhook.
    You can look at http://docs.python.org/dev/library/sys.html#sys.displayhook details.
    That code escapes unicode if terminal doesn't support it.

    The same for Windows, OS X and any other platform.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 15, 2012

    On Windows Vista, I do see that print() behaves differently than
    evaluating the expression. An exception is raised for:
    print('\N{GOTHIC LETTER AHSA}')

    As is for most other characters not supported in your OEM code
    page, e.g. (likely) '\N{GREEK SMALL LETTER ALPHA}'

    On Linux, I see the character print as ? in xterm and as a '?' when
    evaluated. In gnome-terminal (Ubuntu Mono font) it prints as a box
    containing the code point in hex. No exception is raised.

    That's because your terminal output encoding is UTF-8. If you change
    your locale to C, or any other locale that doesn't cover full Unicode
    (e.g. de_DE.ISO-8859-1, if supported on your Linux installation),
    you get the same behavior on Linux as you do on Windows.

    Given that Windows and Linux (Ubuntu) behave differently

    That's not a given, see above.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 15, 2012

    I stand corrected. Thank you for the information.

    The behavior of the console depends on its locale. IDLE has no facility for changing the locale of the PyShell window. Should this option be included somewhere?

    @asvetlov
    Copy link
    Contributor

    I think that doesn't make sense.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 15, 2012

    The Tkinter Text widget is the output for the IDLE shell and it has the limitation imposed by Tcl/Tk of not handling non-BMP unicode characters.

    Is the following reasonable: The IDLE shell console has a locale of "non-BMP utf8"?

    If so, would it be reasonable to add a menu item to switch locales for the shell? This amounts to adding some extra code to OutputWindow's write() to raise encoding errors if the string contains unsupported characters, and possibly replacing characters to work around Tcl/Tk's non-BMP limitation.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 15, 2012

    The behavior of the console depends on its locale. IDLE has no
    facility for changing the locale of the PyShell window. Should this
    option be included somewhere?

    It may be remotely desirable to be able to set the terminal encoding
    in IDLE for debuggging purposes. But it's unrelated to the issue at
    hand.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 15, 2012

    Is the following reasonable: The IDLE shell console has a locale of
    "non-BMP utf8"?

    [BMP utf8]
    That's indeed the approach that Andrew and I were discussing.
    Unfortunately, there is no codec for it yet. We were discussing
    to add a "utf8bom" encoding to Python. This is a medium-sized
    project, though (and again out of scope for this issue).

    If so, would it be reasonable to add a menu item to switch locales
    for the shell? This amounts to adding some extra code to
    OutputWindow's write() to raise encoding errors if the string
    contains unsupported characters, and possibly replacing characters to
    work around Tcl/Tk's non-BMP limitation.

    Please open a separate issue for this.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Mar 15, 2012

    Martin, you are right. I created a separate issue bpo-14326.

    Let me know what I can do to help.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 25, 2012

    New changeset 89878808f4ce by Andrew Svetlov in branch 'default':
    Issue bpo-14200 — now displayhook for IDLE works in non-subprocess mode as well as subprecess.
    http://hg.python.org/cpython/rev/89878808f4ce

    @asvetlov
    Copy link
    Contributor

    After experiments with non-BMP characters I figured out:
    — non-bmp symbols processed by Tk text widgets (Entry, Text etc.) differently. For example Entry can display non-bmp with spaces after glyph, Text reduces symbol to BMP. Editing is also weird.
    — looks like tk event loop passes input of non-bmp directly to tkinter as is.

    Obviously Tk does not support non-BMP chars by spec while not rejects ones strictly. Details are implementation specific and depends not only from Tcl/Tk version but from concrete widget class.

    After that my position is:
    — implement utf8-bmp codec
    — first implementation of utf8-bmp can be done with pure python using utf-8 codec and checks. This way is simple enough while has potential performance degradation. Doesn't matter if codec will be used only for converting relative short strings in Tk widgets.
    — use it in _tkinter AsObj/FromObj functions with 'replace' mode.
    — my approach is a bit incompatible in dark corner matter of non-BMP chars (not supported but silently passed to low-level platform API with weird transitions on the way). I think this is not a problem at all.
    — with utf-8-bmp codec IDLE still can use 'strict' mode in .write function (print and displayhook I mean) to keep current behavior or use escaping for displayhook and 'replace' for regular print. In implementation of bpo-14326 we can use directly specified encoding for print as well.

    I experimented with Ubuntu box but pretty sure — the same result can be reproduced on OS X and Windows as well. Also we need to make Tk to be crossplatform — so replacing non-BMP is not bad but it is good solution until Tcl/Tk will process non-bmp in native manner.

    @asvetlov
    Copy link
    Contributor

    Closing again. Now IDLE works fine both in subprocess and inprocess mode.

    Future support of non-BMP can be continues after implementing codec for that — bpo-14304

    Now I like to close that as «good enough for now».
    At least IDLE doesn't crashed on printing anything.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-IDLE topic-tkinter topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants