classification
Title: [tkinter] surrogate pairs in Tcl/Tk string when pasting an emoji in a text widget
Type: behavior Stage: resolved
Components: macOS, Tkinter Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Pixmew, miss-islington, ned.deily, ronaldoussoren, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2020-11-10 21:22 by ronaldoussoren, last changed 2020-12-25 22:36 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 23281 merged serhiy.storchaka, 2020-11-14 11:48
PR 23784 merged serhiy.storchaka, 2020-12-15 16:55
PR 23787 merged miss-islington, 2020-12-15 18:45
Messages (18)
msg380715 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-10 21:22
As mentioned in msg380552: I get an SyntaxError with message "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." when I paste a smiley emoji in an IDLE interactive shell and try to execute that line, for example using:

>>> print("😀")

The error is likely due to a surrogate pair being present in the UTF-8 representation of a Tcl/Tk string.

It should be possible to work around this in _tkinter.c:unicodeFromTclStringAndSize by merging surrogate pairs. 

This is with:
- Python 3.10
- macOS 11 (arm64)
- Tk 8.6.10

With Tk 8.6.8 (as included in the macOS installers on python.org) printing won't work at all, as mentioned in bpo-42225.
msg380879 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-13 12:51
Just to be sure, what is the result of pasting and executing the following code on Tk 8.6.8 and 8.6.10?

    print(ascii("😀"))
msg380881 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-13 12:56
Well, it is likely the same syntax error. Then what will print

    print(ascii(input()))

when you paste 😀 and press Enter?
msg380906 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 16:09
With 8.6.8 both "hang", in that the Shell window no longer accepts input. I've checked that ``print(input())`` works when I don't use an emoji. 

Interestingly enough, pasting ``print(ascii("😀"))`` into an edit window does work, I can continue editing, but the display is messed up. It looks like:

   print(ascii("😀"))print(ascii("

But with the first two identifiers coloured and the two other identifiers black. Saving the file results in the expected file contents.
msg380907 - (view) Author: Yash Shete (Pixmew) * Date: 2020-11-13 16:28
Well for me in Python 3.9.0
print("😀")   prints 😀  and 
 print(ascii("😀"))   prints '\U0001f600'

It does not Raises error "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." as you are suggesting
msg380908 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 16:35
With 8.6.10:

>>> print(ascii("😀")) raises the SyntaxError mentioned earlier
>>> print(ascii(input())) works and prints:
'\udced\udca0\udcbd\udced\udcb8\udc84'

In an editor window I don't get spurious text, but syntax colouring is a bit off: The text after the closing quote is coloured as if it is inside the string literal. That continues for the characters on the next line
msg380909 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 16:38
@Pixmew: I get this error with Tk 8.6.10 on macOS 11. With Tk 8.6.8 on macOS 10.15 (from the python.org installer) I get the behaviour described in msg380906.

8.6.10 is the version of Tk we'd like to switch to for the "universal2", it is the latest release in the 8.6.x branch and contains numerous bug fixes.

The "Intel" installers (the ones currently on Python.org) we'll continue to use Tk 8.6.8 due to build issues on macOS 10.9 with newer Tk versions.
msg380910 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 17:04
BTW. The unicodeFromTclStringAndSize() basically undoes the special treatment of \0 in Modified UTF-8 [1]. That page says that all known implementation of MUTF-8 treat surrogate pairs the same as CESU-8 [2], which is UTF-8 with characters outside of the BMP encoded as surrogate pairs which are then converted to UTF-8.

Neither encoding is currently supported by Python.

[1] https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
[2] https://en.wikipedia.org/wiki/CESU-8
msg380917 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-13 18:27
Well, try copy 😀 (or other text with color emoji) to clipboard and run the following code:

import tkinter
root = tkinter.Tk()
print(ascii(root.clipboard_get()))
msg380918 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 18:37
When I assign root.clipboard_get() to "v" I get:

>>> print(ascii(v))
'\udced\udca0\udcbd\udced\udcb8\udc84'
>>> print(v)
??????

This is with Tk 8.6.10.
msg380919 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-13 18:38
You can ignore msg380917. It was written before I read msg380908. Now I have the needed information. Thank you.
msg380920 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-13 18:46
And yet one question. What do you see if you print '\udcf0\udc9f\udc98\udc80' in IDLE?
msg380924 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-13 18:50
> And yet one question. What do you see if you print '\udcf0\udc9f\udc98\udc80' in IDLE?

This prints a smiley emoji, likewise for printing chr(128516)
msg380953 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-11-14 02:16
Yash, this is specifically a macOS issue.  Printing astral chars in tkinter/IDLE on Windows and Linux has 'worked' (details not important) for over a year.
msg381019 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-11-15 16:17
New changeset a26215db11cfcf7b5f55cab9e91396761a0e0bcf by Serhiy Storchaka in branch 'master':
bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281)
https://github.com/python/cpython/commit/a26215db11cfcf7b5f55cab9e91396761a0e0bcf
msg383077 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-15 16:45
Oh, the fix is not backported yet.

Automatically backporting does not work because of renames in the supporting test library.
msg383088 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-15 18:45
New changeset 28bf6ab61f77c69b732a211c398ac882bf3f65f4 by Serhiy Storchaka in branch '3.9':
[3.9] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). (GH-23784)
https://github.com/python/cpython/commit/28bf6ab61f77c69b732a211c398ac882bf3f65f4
msg383775 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-12-25 22:35
New changeset 4d840e428ab1a2712f219c5e4008658cbe15892e by Miss Islington (bot) in branch '3.8':
[3.8] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). (GH-23784) (GH-23787)
https://github.com/python/cpython/commit/4d840e428ab1a2712f219c5e4008658cbe15892e
History
Date User Action Args
2020-12-25 22:36:49serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-12-25 22:35:49serhiy.storchakasetmessages: + msg383775
2020-12-15 18:45:10miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request22644
2020-12-15 18:45:09serhiy.storchakasetmessages: + msg383088
2020-12-15 16:55:27serhiy.storchakasetpull_requests: + pull_request22641
2020-12-15 16:45:41serhiy.storchakasetmessages: + msg383077
2020-11-15 16:17:03serhiy.storchakasetmessages: + msg381019
2020-11-14 11:48:37serhiy.storchakasetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request22175
2020-11-14 02:16:10terry.reedysetnosy: + terry.reedy
messages: + msg380953
2020-11-13 18:50:14ronaldoussorensetmessages: + msg380924
2020-11-13 18:46:32serhiy.storchakasetmessages: + msg380920
2020-11-13 18:38:19serhiy.storchakasetmessages: + msg380919
2020-11-13 18:37:15ronaldoussorensetmessages: + msg380918
2020-11-13 18:27:47serhiy.storchakasetmessages: + msg380917
2020-11-13 17:04:27ronaldoussorensetmessages: + msg380910
2020-11-13 16:38:46ronaldoussorensetmessages: + msg380909
2020-11-13 16:35:31ronaldoussorensetmessages: + msg380908
2020-11-13 16:28:24Pixmewsetnosy: + Pixmew
messages: + msg380907
2020-11-13 16:09:58ronaldoussorensetmessages: + msg380906
2020-11-13 12:56:40serhiy.storchakasetmessages: + msg380881
2020-11-13 12:51:33serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg380879
versions: + Python 3.8, Python 3.9, Python 3.10
2020-11-10 21:22:18ronaldoussorencreate