This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author terry.reedy
Recipients ezio.melotti, ned.deily, serhiy.storchaka, taleinat, terry.reedy
Date 2019-09-24.23:15:00
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Recap: IDLE 3.x on Windows exits with UnicodeDecodeError when pasting into editor, grep, or shell window a non-BMP astral character such as
𐒢 '\U000104a2', 𝐇, 🐍 '\U0001F40D', or 
🐱 '\U0001F431' UTF-8 b'\xf0\x9f\x90\xb1', UTF-16LI b'\x3d\xd8\x31\xdc'.  Display issues are not directly of this issue.

The exact error message has varied with the python version, but all likely result from the same error.

3.2 msg145581: traceback PyShell.main(), root.mainloop(), tk,mainloop().

  'utf8' codec can't decode bytes in position 1-2: invalid continuation byte

3.3 msg177750: traceback starts with two calls in new runpy module.
  'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

3.6 to now: same traceback
  'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

The initial byte is 0xed regardless of which astral char above is pasted.  Tal, if the problem were utf-8 decoding uft-16le bytes, the initial byte in the error message for astral chars would (usually) be 0xd8, and there would be problems with BMP chars also.

In msg145584, I speculated that the problem might be trying to decode a now illegal utf-8 encoding of a surrogate character.  In msg145605, Ezio said that the first surrogate would be '\ud801' and showed that the 2.7 utf-8 'encoding' of that is b'\xed\xa0\x81' and that trying to decode that give the 3.2 error above, but with '0-1' instead of '1-2'.  (0xed is the utf-8 start byte for any BMP char and continuation bytes that map to the surrogate blocks, and some others, are now invalid.)  Today, b'\xed\xa0\x81'.decode('utf-8') gives exactly the current message above.

In msg254165, I noted that pasting copied astral chars into a plain Text widget works in the sense that there is no error.  (For me, 𐒢 is replaced by two replacement chars and the others are shown without colors, but this depends on OS and font.) I just verified this same for Entry widgets in IDLE dialogs and the Font settings sample text.  As Serhiy said in msg254165, Left x 2 is needed to move back past the char and Backspace x 2 to delete it.  (For me, only 1 Right is needed to move forward past the char.)  But Serhiy also showed that once an astral char *is* displayed, it cannot be properly retrieved.

So the question is, if windows puts utf-16le surrogates on the clipboard, and they can be pasted and displayed some in a Text, why is something trying to utf-8 decode the utf-8 encoding of each surrogate when pasting into IDLE's augmented text?

In msg207381, Serhiy claimed "The root of issue is in converting strings when passed to Python-implemented callbacks. When a text is pasted in IDLE window, the callback is called (for highlighting). ...".  He goes on to explain that tcl *does* encode surrogates to modified utf-8 before passing to them to callbacks and claimed that tkinter_pythoncmd_args_2.patch should fix this.

Disabling Colorizer is not enough to allow astral pasting.  See PR 16365. Whatever Serhiy's patch did 5 years ago, my copy does not work now.  See PR 16365. 

Tal, we augment the x11 paste callback in pyshell.fix_x11_paste.  There is no unittest and we would have to not break this with further change.

I have thought about replacing the paste callback with clipboard_get, but worried that we might not be able to replicate what the system-specific tcl/tk/C code does.  That sometimes includes displaying the actual astral character. I presume that tcl just passes the clipboard bytes to the graphics system, which we cannot do from python.

Anyway, you have shown that clipboard.get does not currently work as we might want.  From what Serhiy has said, char *s points to invalid utf-8 bytes.
Date User Action Args
2019-09-24 23:15:00terry.reedysetrecipients: + terry.reedy, taleinat, ned.deily, ezio.melotti, serhiy.storchaka
2019-09-24 23:15:00terry.reedysetmessageid: <>
2019-09-24 23:15:00terry.reedylinkissue13153 messages
2019-09-24 23:15:00terry.reedycreate