Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDLE 3.x on Windows exits when pasting non-BMP unicode #57362

Closed
jbvsmo mannequin opened this issue Oct 11, 2011 · 77 comments
Closed

IDLE 3.x on Windows exits when pasting non-BMP unicode #57362

jbvsmo mannequin opened this issue Oct 11, 2011 · 77 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-IDLE topic-tkinter topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@jbvsmo
Copy link
Mannequin

jbvsmo mannequin commented Oct 11, 2011

BPO 13153
Nosy @terryjreedy, @vstinner, @taleinat, @ned-deily, @ezio-melotti, @aivarannamaa, @serhiy-storchaka, @animalize, @miss-islington
PRs
  • bpo-13153: Test IDLE Windows no-color astral paste fix #16363
  • bpo-13153: _tkinter part of tkinter_pythoncmd_args_2 for paste fix #16365
  • bpo-13153: Use OS native encoding for converting between Python and Tcl. #16545
  • [3.8] bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545) #16580
  • [3.7] bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545) #16581
  • Files
  • tkinter_nobmp_error.patch
  • tkinter_string_conv_3.patch
  • tkinter_pythoncmd_args.patch
  • tkinter_pythoncmd_args_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2019-10-04.11:44:01.512>
    created_at = <Date 2011-10-11.20:01:32.297>
    labels = ['type-bug', 'expert-tkinter', '3.9', '3.8', 'expert-IDLE', '3.7', 'expert-unicode']
    title = 'IDLE 3.x on Windows exits when pasting non-BMP unicode'
    updated_at = <Date 2019-12-20.12:13:44.282>
    user = 'https://github.com/jbvsmo'

    bugs.python.org fields:

    activity = <Date 2019-12-20.12:13:44.282>
    actor = 'terry.reedy'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2019-10-04.11:44:01.512>
    closer = 'serhiy.storchaka'
    components = ['IDLE', 'Tkinter', 'Unicode']
    creation = <Date 2011-10-11.20:01:32.297>
    creator = 'JBernardo'
    dependencies = []
    files = ['28376', '31610', '33318', '34016']
    hgrepos = []
    issue_num = 13153
    keywords = ['patch']
    message_count = 77.0
    messages = ['145363', '145366', '145369', '145573', '145580', '145581', '145584', '145585', '145605', '145607', '145611', '145616', '145635', '155799', '155801', '155810', '155814', '155857', '177750', '177778', '177812', '179276', '182106', '182130', '182141', '182166', '182172', '182180', '182181', '182184', '182207', '182305', '182312', '192240', '192243', '192260', '193566', '194564', '194586', '194603', '194623', '196958', '196993', '201457', '207370', '207371', '207381', '210786', '254165', '254170', '318862', '319247', '348153', '352661', '353083', '353123', '353152', '353761', '353765', '353766', '353767', '353769', '353812', '353833', '353848', '353897', '353898', '353901', '353903', '353910', '353916', '353918', '353919', '353937', '353945', '358701', '358704']
    nosy_count = 9.0
    nosy_names = ['terry.reedy', 'vstinner', 'taleinat', 'ned.deily', 'ezio.melotti', 'Aivar.Annamaa', 'serhiy.storchaka', 'malin', 'miss-islington']
    pr_nums = ['16363', '16365', '16545', '16580', '16581']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13153'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @jbvsmo
    Copy link
    Mannequin Author

    jbvsmo mannequin commented Oct 11, 2011

    I was playing with some unicode chars on Python 3.2 (x64 on Windows 7), but when pasted a char bigger than 0xFFFF, IDLE crashes without any error message.

    Example (works fine):
    >>> '\U000104a2'
    '𐒢'

    But, if I try to paste the above char, the window will instantly close.

    The interpreter uses 2-bytes per char (UTF-16) and I don't know if that's causing the problem (as side note, why don't the default Windows build uses 4-bytes char?).

    I can't check now with my Ubuntu install (UTF-32) if the problem persists.

    @ned-deily
    Copy link
    Member

    This is related to bpo-12342. The problem is that Tcl/Tk 8.5 (and earlier) do not support Unicode code points outside the BMP range as in this example. So IDLE will be unable to display such characters but it should not crash either.

    @ned-deily ned-deily self-assigned this Oct 11, 2011
    @jbvsmo
    Copy link
    Mannequin Author

    jbvsmo mannequin commented Oct 11, 2011

    @ned

    That looks like a bit different case. IDLE *can* print the char after you entered the '\Uxxxxxxxx' version of it.

    It doesn't accept you to paste those caracters...

    @ezio-melotti ezio-melotti added the type-bug An unexpected behavior, bug, or error label Oct 14, 2011
    @terryjreedy
    Copy link
    Member

    The current Windows build used 2-byte unicode chars because that is what Windows does. In 3.3, all builds will use a new unicode implementation that uses 1,2,or4 bytes as needed. But I suspect we will still have the paste problem unless we can somehow bypass the tk limitation.

    Printing a Python string to the screen does not seem to involve conversion to a tk string. Or else tk blindly copies surrogate pairs to Windows even though it cannot create them.

    In any case, true window-closing crashes (as opposed to an error traceback) are obnoxious bugs that we try to fix if possible. I verified this on my 64-bit Win 7 system. Thanks for the report. Feel free to look into the code if you can.

    @terryjreedy terryjreedy added type-crash A hard crash of the interpreter, possibly with a core dump and removed type-bug An unexpected behavior, bug, or error labels Oct 15, 2011
    @jbvsmo
    Copy link
    Mannequin Author

    jbvsmo mannequin commented Oct 15, 2011

    Just for comparison, on Python 2.7.1 (x32 on Windows 7) it's possible to paste the char (but can't use it) and a nice error is given.

    >>> u'𐒢'
    Unsupported characters in input

    So the problem was partially solved but something might have happened with the 3.x port...

    Searching on both source codes, I can see the following block was commented on Python3.2 but not on Python2.7 (Maybe someone removed someone else's bug fix?) and an assert was added.

    #--- Lines 605 to 613 of PyShell.py

    assert isinstance(source, str)
    #                       v-- on Python2.7 it is types.UnicodeType instead
    #if isinstance(source, str):
    #    from idlelib import IOBinding
    #    try:
    #        source = source.encode(IOBinding.encoding)
    #    except UnicodeError:
    #        self.tkconsole.resetoutput()
    #        self.write("Unsupported characters in input\n")
    #        return

    I uncommented those lines, removed the assert and deleted pycache for fresh bytecode but the error keeps happening.

    This function runsource() is only called after the return key is pressed so the bug was introduced on another part of the program.

    I'll search further but it's hard to do that without traceback of the error.

    (Maybe runit() is the problem because it seems to build the line and call runsource(line))

    ------
    PS: @terry Reedy
    That looks nice to have different lengths for chars but what will be the impact on performance? Indexing will still be in constant time?

    @jbvsmo
    Copy link
    Mannequin Author

    jbvsmo mannequin commented Oct 15, 2011

    Just to complete my monologue:
    Here's the traceback from running IDLE in cmd line.

    C:\Python32\Lib\idlelib>python -i idle.py
    Traceback (most recent call last):
      File "idle.py", line 11, in <module>
        idlelib.PyShell.main()
      File "C:\Python32\Lib\idlelib\PyShell.py", line 1429, in main
        root.mainloop()
      File "C:\Python32\Lib\tkinter\__init__.py", line 1009, in mainloop
        self.tk.mainloop(n)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid continuation byte

    Not much meaningful but is better than nothing... Looks like some traceback is missing, and this one points to tkinter.

    @terryjreedy
    Copy link
    Member

    [Yes, indexing will still be O(1), though I personally consider that less important than most make it to be. Consistency across platforms and total time and space performance of typical apps should be the concern. There is ongoing work on improving the new implementation. Some operations already take less space and run faster.]

    The traceback may very well be helpful. It implies that copying a supplemental char does not produce proper utf-8 encoded bytes. Or if it does, tkinter (or tk underneath it) does not recognize them. But then the problem should be the initial byte, not the continuation bytes, which are the same for all chars and which all have 10 for their two high order bits. See
    https://secure.wikimedia.org/wikipedia/en/wiki/Utf-8
    for a fuller explanation.

    Line 1009 is the definition of Misc.mainloop(). I believe self.tk represents the embedded tcl interpreter, which is a black box from Python's viewpoint. Perhaps we should wrap the call with

    try:
    self.tk.mainloop(n)
    except Exception as e:
    <print error message with all info attached to e before exiting>

    This should catch any miscellaneous crashes which are not otherwise caught and maybe turn the crash issues into bug reports -- the same way that running from the command line did. (It will still be good to catch what we can at error sites and give better, more specific messages.) (What I am not familiar with is how the command line interpreter might turn a tcl error into a python exception and why IDLE does not.)

    When I copy '𐒢' and paste into the command line interpreter or Notepad++, I get '??'. I am guessing that ?? represent a surrogate pair and that Windows separately encodes each. The result would be 'illegal' utf-8 with an illegal continuation chars. An application can choose to decode the 'illegal' utf-8 -- or not. Python can when "errors='surrogate_escape" (or something like that) is specified. It might be possible to access the raw undecoded bytes of the clipboard with the third party pythonwin module. I do not know if there is anyway to do so with tk.

    I wonder if tcl is calling back to Python for decoding and whether there was a change in the default for errors or the callback specification that would explain a change from 2.7 to 3.2.

    Ezio, do you know anything about these speculations?

    @ned-deily
    Copy link
    Member

    Thanks for the additional investigation. You don't see more in the traceback because the exception is occurring in the _tkinter C glue layer. I am able to reproduce the problem on some other platforms as well (e.g. Python 3.x on OS X with Carbon Tk 8.4). More later.

    @ezio-melotti
    Copy link
    Member

    Ezio, do you know anything about these speculations?

    Assuming that the non-BMP character is represented with two surrogates (\ud801\udca2) and that _tkinter tries to decode them independently, the error message ("invalid continuation byte") would be correct.

    Python 2 UTF-8 codec is more permissive and allows encoding/decoding of surrogates (this might also explain why it works on Python 2): 
    >>> u'\ud801'.encode('utf-8')
    '\xed\xa0\x81'
    >>> '\xed\xa0\x81'.decode('utf-8')
    u'\ud801'
    
    But on Python 3, trying to decode that results in an error:
    >>> b'\xed\xa0\x81'.decode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

    But then the problem should be the initial byte, not the continuation
    bytes, which are the same for all chars and which all have 10 for
    their two high order bits.

    While it's true that all continuation bytes have the first two bits equal to '10', the opposite is not always true. Some start bytes have additional restrictions on the continuation bytes. For example, even if the first two bits of 0xA0 (0b10100000) are '10', the valid continuation bytes for a sequence starting with 0xED are restricted to the range 80..9F.

    The fact that
    >>> '\U000104a2'
    '𐒢'
    works is because the input is all ASCII, so the decoding doesn't fail.

    [...]
    This should catch any miscellaneous crashes which are not otherwise
    caught and maybe turn the crash issues into bug reports -- the same
    way that running from the command line did.

    Having some "safe net" to catch all the unhandled exceptions seems like a good idea. This won't work in case of segfaults, but it's still better than nothing. I'm not sure what you mean with "turn them into bug reports" though.

    @ezio-melotti
    Copy link
    Member

    This can also be reproduced by doing:
    >>> print('\U000104a2'[0])
    ?
    and then copy/pasting the lone surrogate.
    The traceback is:
      [...]
      File "C:\Programs\Python32\Lib\tkinter\__init__.py", line 1009, in mainloop
        self.tk.mainloop(n)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte

    @terryjreedy
    Copy link
    Member

    I'm not sure what you mean with "turn them into bug reports" though.

    In about the last month, there have been, I think, 4 reports about IDLE crashing (quitting unexpectedly with no error traceback). I would consider it preferable if it quit with an error traceback that gave as much info as available, or if there is none, just said "IDLE has met an unexpected problem.", perhaps followed by something like "Pleaee note the circumstances and make a report of the tracker is there is none already."

    @ezio-melotti
    Copy link
    Member

    I would consider it preferable if it quit

    Note that if we catch the error there might be no reason for IDLE to quit (unless the error left IDLE in some invalid state).

    with an error traceback that gave as much info as available,

    That might scare newbies away.

    or if there is none, just said "IDLE has met an unexpected problem.",

    So this might be better for all the cases.

    perhaps followed by something like "Pleaee note the circumstances and
    make a report of the tracker is there is none already."

    The first message could offer a "Report the problem" option that links to the tracker. In theory we could also have a way to auto-fill the tracker issue, but that might lead to duplicates.

    @ned-deily
    Copy link
    Member

    Just to be sure we're talking about the same thing here, my understanding is that the "missing traceback" issues referred to here are only an issue when IDLE is run as a stand-alone GUI program, such as can be done on Windows and with the OS X IDLE.app. In that case, the standard Python tracebacks from the interpreter written to stderr are not readily visible to the user. In the OS X IDLE.app case it does get captured in a system log. I'm not sure if that happens anywhere in the Windows cases. If IDLE is started from a terminal window or console window where stderr is displayed, this is not an issue.

    But I think further discussion about proposed improvements to IDLE diagnostics could be useful but it is not germane to the specific bug here. It should be carried out elsewhere, possibly resulting in a feature request.

    @ned-deily
    Copy link
    Member

    Reassigning to Andrew to investigate solution similar to the one used in bpo-14200.

    @ned-deily ned-deily assigned astrand and unassigned ned-deily Mar 14, 2012
    @ned-deily ned-deily changed the title IDLE crash with unicode bigger than 0xFFFF IDLE crashes when pasting non-BMP unicode char on UCS-16 build Mar 14, 2012
    @ned-deily
    Copy link
    Member

    (Oops, wrong assignment!)

    @ned-deily ned-deily assigned asvetlov and unassigned astrand Mar 14, 2012
    @terryjreedy
    Copy link
    Member

    AFAIK, the big new feature of tcl/tk 9.0 is intended to be full unicode support. We can hope that 9.0 appears in time to be included in the 3.8 installers.

    Until then, I think filenames, user program output, and clipboard content should be checked for the presence of astral characters before being sent to a tk widget. For this issue, that means replacing the built-in <<Paste>> handler. Replace astral chars with \U000nnnn escapes. If the widget it a Text, tag the escape as 'Astral' and color it with the code context colors to distinguish it from escapes originally in the string.

    Strings know their kind, but a request to expose that has been rejected. Pyshell currently compares the max codepoint to 'ffff'. But it appears that we can detect kind with an O(1) expression. For 3.6 and 3.7, "sys.getsizeof(s) == 76 + len(s)". For 3.8, "sys.getsizeof(s) == 48 + len(s)". Does anyone know why the difference?

    @terryjreedy
    Copy link
    Member

    Closed bpo-37614 in favor of this.

    We now have only Python with FSR and mostly only tcl 8.6 to worry about. But I presume the Windows clipboard still uses uft-16le. Experimenting with pasting 𐒢 or '𐒢', I usually get the 'ed' message as before, but with the quoted astral, IDLE somethings hangs. If I wait before trying to close, I get a message from Windows about waiting or closing.

    Currently, an attempt to print an astral char, as opposed to paste, results in
    >>> print('\U00011111')
    Traceback (most recent call last):
      File "<pyshell#0>", line 1, in <module>
        print('\U00011111')
    UnicodeEncodeError: 'UCS-2' codec can't encode character '\U00011111' in position 0: Non-BMP character not supported in Tk
    Improving this is a separate issue, as is editing a .py file with an astral char in the name or test.

    @terryjreedy terryjreedy added the 3.9 only security fixes label Jul 19, 2019
    @terryjreedy terryjreedy changed the title IDLE 3.x on Windows crashes when pasting non-BMP unicode IDLE 3.x on Windows exits when pasting non-BMP unicode Jul 19, 2019
    @terryjreedy
    Copy link
    Member

    Another report today on idle-dev that pasting emoji exits IDLE.

    Serhiy, I applied the _tkinter part of your...args_2.patch to a branch of current master -- see serhiy_tkinter.patch. (Could push branch if helpful.).

    After recompiling _tkinter.c, pasting 🐱 still gives same error.

    @taleinat
    Copy link
    Contributor

    I can confirm that the crash from pasting these characters happens when trying to fetch the clipboard contents. We can override the built-in <<Paste>> event, but then we have to get the clipboard's contents directly, and the only portable way to do that in the stdlib is via Tkinter's clipboard_get(). (For a non-stdlib solution, check out pyperclip on PyPI.)

    clipboard_get(), which I assume calls what Tk uses internally to handle the <<Paste>> event, crashes in the C code with a UnicodeDecodeError. Here's a traceback from calling clipboard_get() with 🐱 in the clipboard (Windows 10, recent master branch, i.e. to be 3.9):

    Exception in Tkinter callback
    Traceback (most recent call last):
      File "C:\Users\Tal\dev\cpython\lib\tkinter\__init__.py", line 1885, in __call__
        return self.func(*args)
      File "C:\Users\Tal\dev\cpython\lib\idlelib\multicall.py", line 176, in handler
        r = l[i](event)
      File "C:\Users\Tal\dev\cpython\lib\idlelib\editor.py", line 618, in paste
        print(self.text.clipboard_get())
      File "C:\Users\Tal\dev\cpython\lib\tkinter\__init__.py", line 867, in clipboard_get
        return self.tk.call(('clipboard', 'get') + self._options(kw))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

    From a quick look, this appears to be happening in _tkinter.c, here:

    static PyObject *
    unicodeFromTclStringAndSize(const char *s, Py_ssize_t size)
    {
        PyObject *r = PyUnicode_DecodeUTF8(s, size, NULL);
        ...

    My guess is that Tk is passing the clipboard contents as-is, and we're simply not decoding it with the proper encoding (i.e. utf-16le on Windows).

    Is this something worth fixing / working around in Tkinter, e.g. by using a proper encoding depending on the platform for fetching clipboard contents? Or are we content to continue waiting for Tk to fix this?

    @terryjreedy
    Copy link
    Member

    Recap: IDLE 3.x on Windows exits with UnicodeDecodeError when pasting into editor, grep, or shell window a non-BMP astral character such as
    𐒢 '\U000104a2', 𝐇, 🐍 '\U0001F40D', or
    🐱 '\U0001F431' UTF-8 b'\xf0\x9f\x90\xb1', UTF-16LI b'\x3d\xd8\x31\xdc'. Display issues are not directly of this issue.

    The exact error message has varied with the python version, but all likely result from the same error.

    3.2 msg145581: traceback PyShell.main(), root.mainloop(), tk,mainloop().

    'utf8' codec can't decode bytes in position 1-2: invalid continuation byte

    3.3 msg177750: traceback starts with two calls in new runpy module.
    'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

    3.6 to now: same traceback
    'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

    The initial byte is 0xed regardless of which astral char above is pasted. Tal, if the problem were utf-8 decoding uft-16le bytes, the initial byte in the error message for astral chars would (usually) be 0xd8, and there would be problems with BMP chars also.

    In msg145584, I speculated that the problem might be trying to decode a now illegal utf-8 encoding of a surrogate character. In msg145605, Ezio said that the first surrogate would be '\ud801' and showed that the 2.7 utf-8 'encoding' of that is b'\xed\xa0\x81' and that trying to decode that give the 3.2 error above, but with '0-1' instead of '1-2'. (0xed is the utf-8 start byte for any BMP char and continuation bytes that map to the surrogate blocks, and some others, are now invalid.) Today, b'\xed\xa0\x81'.decode('utf-8') gives exactly the current message above.

    In msg254165, I noted that pasting copied astral chars into a plain Text widget works in the sense that there is no error. (For me, 𐒢 is replaced by two replacement chars and the others are shown without colors, but this depends on OS and font.) I just verified this same for Entry widgets in IDLE dialogs and the Font settings sample text. As Serhiy said in msg254165, Left x 2 is needed to move back past the char and Backspace x 2 to delete it. (For me, only 1 Right is needed to move forward past the char.) But Serhiy also showed that once an astral char *is* displayed, it cannot be properly retrieved.

    So the question is, if windows puts utf-16le surrogates on the clipboard, and they can be pasted and displayed some in a Text, why is something trying to utf-8 decode the utf-8 encoding of each surrogate when pasting into IDLE's augmented text?

    In msg207381, Serhiy claimed "The root of issue is in converting strings when passed to Python-implemented callbacks. When a text is pasted in IDLE window, the callback is called (for highlighting). ...". He goes on to explain that tcl *does* encode surrogates to modified utf-8 before passing to them to callbacks and claimed that tkinter_pythoncmd_args_2.patch should fix this.

    Disabling Colorizer is not enough to allow astral pasting. See PR 16365. Whatever Serhiy's patch did 5 years ago, my copy does not work now. See PR 16365.

    Tal, we augment the x11 paste callback in pyshell.fix_x11_paste. There is no unittest and we would have to not break this with further change.

    I have thought about replacing the paste callback with clipboard_get, but worried that we might not be able to replicate what the system-specific tcl/tk/C code does. That sometimes includes displaying the actual astral character. I presume that tcl just passes the clipboard bytes to the graphics system, which we cannot do from python.

    Anyway, you have shown that clipboard.get does not currently work as we might want. From what Serhiy has said, char *s points to invalid utf-8 bytes.

    @serhiy-storchaka
    Copy link
    Member

    I now have an access to Windows (I did not have it 5 years ago), so I'm going to finish this issue if I have a time.

    @serhiy-storchaka
    Copy link
    Member

    PR 16545 solves the problem by using OS specific methods for converting between Python and Tcl strings. It is not ideal, but is good enough for most real cases.

    Now you can paste, copy and print non-BMP characters. The code containing them can be displayed weird, but the result of print looks OK.

    >>> '\N{PERSONAL COMPUTER}'
    '💻'
    >>> print('💻')
    💻

    As a side effect, printing '\udcf0\udc9f\udc90\udc8d' on Linux and '\ud83d\udcbb' on Windows should have the same effect as printing '\U0001f4bb'.

    I do not know about macOS, but expect the same behavior as on Linux. Could anybody test please?

    @taleinat
    Copy link
    Contributor

    taleinat commented Oct 2, 2019

    Serhiy, this looks like a great step in the right direction!

    Tested on Win10 with PR python/issues-test-cpython#16545 (commit f4db0e7). Here is a copy/paste from an IDLE shell session:

    >>> '\N{PERSONAL COMPUTER}'
    '�'
    >>> print('💻')
    SyntaxError: 'utf-8' codec can't encode characters in position 7-12: surrogates not allowed

    Note that in the first output, the second and third chars in the string aren't visible in IDLE; i.e. what is actually displayed is 'ð»'.

    @taleinat
    Copy link
    Contributor

    taleinat commented Oct 2, 2019

    Not sure if this helps, but a bit of experimentation brought this up:

    >>> '\N{PERSONAL COMPUTER}'.encode('utf-8')
    b'\xf0\x9f\x92\xbb'
    >>> '�'.encode('utf-16le')
    b'\xf0\x00\x9f\x00\x92\x00\xbb\x00'
    >>> '�'.encode('utf-16')
    b'\xff\xfe\xf0\x00\x9f\x00\x92\x00\xbb\x00'

    @taleinat
    Copy link
    Contributor

    taleinat commented Oct 2, 2019

    More info:

    >>> '\N{PERSONAL COMPUTER}'.encode('utf-8').decode('latin-1') == '�'
    True

    @serhiy-storchaka
    Copy link
    Member

    Sorry, I did not test the last version on Windows. There was a bug which caused using the Linux version on Windows. Now it should be fixed.

    @terryjreedy
    Copy link
    Member

    The revised PR appears to fix this and other issues, although the presence of astral chars in code being edited messes up tk's cursor positioning. Assuming that this cannot be changed, we could add the the ability to replace astral chars with \U escapes.

    @serhiy-storchaka
    Copy link
    Member

    From the point of view of Tk, the astral character "💻" looks like either two invisible characters "\ud83d\udcbb" or as four characters "ð\x9f\x92»" (two of them are invisible). Thus this breaks editing the physical line past the astral character. We cannot do anything with this.

    It also breaks syntax highlighting up to 100 lines past the astral character. We can add a workaround for this, but I am not sure it is worth. The solution could be complex and slow down the common case. In any case it is a different issue.

    File names with astral characters now are shown correctly in most cases. Astral characters are not shown in the title of the window, perhaps it is font depending.

    Opening a file name with astral characters works in the command line, but it does not work via the file open dialog. This looks like a bug in Tk, we cannot workaround it (or at least the possible workaround would be ugly and partial).

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Oct 3, 2019

    Thus this breaks editing the physical line past the astral character. We cannot do anything with this.

    I tried, it's sad the experience is not very good.

    @terryjreedy
    Copy link
    Member

    A week ago, I thought that the astral solution was to always replace with the \U escape. With this patch, we can and should send them to read-only text windows, and let the OS and font display it or a substitute. On Windows, at least, the emoji which beginners most often want to use get displayed.

    Elsewhere, we will have to check and do some follow-up patches. For using file names with astral chars results, on Windows, in six large boxes, and when the file is saved, it is saved in a new file with the boxes, not the original file. Such file names are not added to the recent files list, or maybe list boxes cannot handle them.

    Code is another issue. Astral chars in files could be replaced when read. Unfortunately, I believe some are legal identifier chars. On the clipboard, on Windows, astral chars become sequences of 6 surrogates.

    >>> r.clipboard_clear()
    >>> r.clipboard_append('🚀')
    >>> r.clipboard_get()
    '\udced\udca0\udcbd\udced\udcba\udc80'

    Perhaps we should try to intercept paste and replace such sequences with the \U escape.

    @serhiy-storchaka
    Copy link
    Member

    Thank you for your example Terry. There was one dubious place which I did not change because I did not know how to trigger the execution of it. Now the clipboard is fixed.

    @terryjreedy
    Copy link
    Member

    What do you mean by fixed? After deleting and remaking a pr_16545 branch, I see the same result for clipboard_get.

    @serhiy-storchaka
    Copy link
    Member

    What is the result of new tests?

    python.bat -m test -v -uall test_tk -m test_clipboard*
    

    @terryjreedy
    Copy link
    Member

    After remembering to recompile (sorry), the test passes and clipgoard_get returns the rocket. Very nice, thank you.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 06cb94b by Serhiy Storchaka in branch 'master':
    bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)
    06cb94b

    @miss-islington
    Copy link
    Contributor

    New changeset 6c3fbbc by Miss Islington (bot) in branch '3.7':
    bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)
    6c3fbbc

    @miss-islington
    Copy link
    Contributor

    New changeset dc19124 by Miss Islington (bot) in branch '3.8':
    bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)
    dc19124

    @vstinner
    Copy link
    Member

    vstinner commented Oct 4, 2019

    bpo-13153: Use OS native encoding for converting between Python and Tcl. (GH-16545)

    WOW. That's huge. The issue with non-BMP characters has been fixed? Finally? The issue was haunting the bug tracker for at least 8 years!!!

    @taleinat
    Copy link
    Contributor

    taleinat commented Oct 4, 2019

    Indeed, Serhiy, you've done an amazing job with this change and it will greatly benefit many people.

    @aivarannamaa
    Copy link
    Mannequin

    aivarannamaa mannequin commented Dec 20, 2019

    >> '\N{PERSONAL COMPUTER}'

    freezes IDLE 3.7.6 (64-bit, downloaded from python.org) on macOS 10.15

    Can it be because Tk 8.6.8 is still used there?

    @terryjreedy
    Copy link
    Member

    On Windows with 8.6.9, I see '\U0001f4bb' on 3.7.5 and '💻' on 3.8.0 and 3.9.0a0. I don't know why the difference as Serhiy's patch was backported. I will upgrade 3.7 and 3.8 and try again.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes topic-IDLE topic-tkinter topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants