classification
Title: IDLE freezes when opening a file with astral characters
Type: behavior Stage: resolved
Components: IDLE Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: terry.reedy Nosy List: David E. Franco G., eryksun, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2017-04-07 20:15 by David E. Franco G., last changed 2019-10-04 12:07 by serhiy.storchaka. This issue is now closed.

Messages (6)
msg291289 - (view) Author: David E. Franco G. (David E. Franco G.) Date: 2017-04-07 20:15
wandering for the internet I fount some unicode character in a random comment, and just for curiosity I wanted to use python (3.6.1) to see their value, so I copy those characters and paste them in IDLE, and in doing so it just close without warning or explanation.

the character in question are: 🔫 🔪
(chr(128299) and chr(128298))

then I put them in a script 

    text = "🔫 🔪"
    print(text)

and try to load it but instead it open a new empty scrip, again without apparent reason, which for some reason I can't close, I needed to kill the process for that.

I try the same with the IDLE in python 2.7.13 for the first one I got

    Unsupported characters in input

which at least is something, and changing the script a little

    # -*- coding: utf-8 -*-
    text = u"🔫 🔪"
    print(text)

it work without problem and print correctly. 

Also opening the script in interactive mode (python -i myscript.py) it work as expected and I get their numbers (that I put above).

So why is that? and please fix it.
msg291313 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-04-08 02:57
Your report touches on four different issues.

1. IDLE uses tkinter for the GUI and tkinter wraps tcl/tk 8.6 or earlier, which only handles Basic Multilingual Plane chars (codepoints 0000 to FFFF). The gun and knife chars are out of that range.  I have read that 8.7 will handle all unicode chars and that it might be released this year.  This will fix multiple IDLE issues (#21084, #22742, #13153, #14304).

2. Until 3.3, CPython used two internal unicode encoding schemes: Narrow and wide builds.  Narrow builds used pairs of surrogate chars in the BMP to represent chars not in the BMP.  So your '3 char' string is stored as 5 chars.

# -*- coding: utf-8 -*-
text = u"🔫 🔪"
for c in text:
    print(ord(c))

*prints, on 2.7.13 on Windows
55357
56619
32
55357
56618

Windows releases and a few *nix releases used narrow builds.  I have Windows, so I could copy and paste your string and run the above.  Most *nix releases used wide builds and could not.

In 3.3, CPython switched to a flexible representation used on all releases.  Like earlier wide builds, it does not use surrogate chars, and the above no longer loads into a text widget, let alone run.

3. Any exception not caught by IDLE generates a traceback that appears on the console used to start IDLE, if there is one.  Otherwise, it is lost.  Running "<python3> -m idlelib" in a console and loading the code above results in a traceback ending with "_tkinter.TclError: character U+1f52b is above the range (U+000)-(U+ffff) allowed by Tcl".

I have though about (and previously commented about) trying to put such messages in a tk message box when IDLE and tk are still functioning well enough to do so before stopping.

4. It appears the IDLE opens a file that is not already open by creating a blank editor window and then loading the file into it.  In this case, when the load fails, a frozen window is left that blocks IDLE closing.  This nasty behavior, which I verified, is new to me, so I am leaving this issue open specifically for this bug.

A likely fix is to catch TclError in the appropriate place (not yet known to me), recover (delete the new EditorWindow and ???), display an explanatory error message, and continue when user hits [OK].

We already have a test string.  If needed, I will try to separate 'opening a file name' from 'reading the contents', so a test can use a simple mock file object.

If a file were opened and read *before* creating a new window, there would be less cleanup to do.
msg291317 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-04-08 04:23
In Windows IDLE 3.x, you should still be able to print a surrogate transcoding, which sneaks the native UTF-16LE encoding around tkinter:

    def transurrogate(s):
        b = s.encode('utf-16le')
        return ''.join(b[i:i+2].decode('utf-16le', 'surrogatepass') 
                       for i in range(0, len(b), 2))

    def print_surrogate(*args, **kwds):
        new_args = []
        for arg in args:
            if isinstance(arg, str):
                new_args.append(transurrogate(s))
            else:
                new_args.append(arg)
        return print(*new_args, **kwds)


    >>> s = '\U0001f52b \U0001f52a'
    >>> print_surrogate(s)
    🔫 🔪

Pasting non-BMP text into IDLE fails on Windows for a similar reason. Tk naively encodes the surrogate codes in the native Windows UTF-16 text as invalid UTF-8, which I've seen refereed to as WTF-8 (Wobbly). I see the following error when I run IDLE using python.exe (i.e. with a console) and paste "🔫 🔪" into the window:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 1: invalid continuation byte

This is the second byte of the WTF-8 encoding:

    >>> transurrogate('"\U0001f52b').encode('utf-8', 'surrogatepass')
    b'"\xed\xa0\xbd\xed\xb4\xab'

Hackiness aside, I don't think it's worth supporting this just for Windows.
msg292607 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-04-29 23:09
#30209 has a file that cannot be opened and another description of symptoms.
msg293518 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-05-11 22:16
Eryk> I tried print(transurrogate(s)) from editor and it worked!, printing '🔫 🔪' (though not near as pretty as here in Firefox.

I have previously thought of scanning strings before inserting into Text widgets and converting astral chars to \U000nnnnn form, with a color tag to distinguish such displays from the literal ten chars.   This should work fine for read-only output to Shell but not easily in Editor.  On Windows, I would just use transurrogate instead, and that would also work for loading files into the editor.

I believe transurrogate could potentially solve pasting on Windows by intercepting <<Paste>> events.

Serhiy, are you aware of Eryk's astral char workaround for Windows?  Could it be incorporated into _tkinter, where a string's kind is accessible?
msg353934 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-10-04 12:07
This looks like a duplicate of issue21084. Fixed by PR 16545 (see issue13153). It is virtually Eryk's workaround, but at the Tkinter level.
History
Date User Action Args
2019-10-04 12:07:50serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg353934

stage: needs patch -> resolved
2017-05-11 22:16:55terry.reedysetnosy: + serhiy.storchaka

messages: + msg293518
versions: + Python 3.7
2017-04-29 23:09:12terry.reedysetmessages: + msg292607
2017-04-29 23:08:39terry.reedylinkissue30209 superseder
2017-04-08 04:23:47eryksunsetnosy: + eryksun
messages: + msg291317
2017-04-08 02:57:35terry.reedysettitle: IDLE got unexpexted bahavior when trying to use some characters -> IDLE freezes when opening a file with astral characters
messages: + msg291313
stage: needs patch
2017-04-07 20:15:46David E. Franco G.create