New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tkinter] surrogate pairs in Tcl/Tk string when pasting an emoji in a text widget #86484
Comments
As mentioned in msg380552: I get an SyntaxError with message "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." when I paste a smiley emoji in an IDLE interactive shell and try to execute that line, for example using:
The error is likely due to a surrogate pair being present in the UTF-8 representation of a Tcl/Tk string. It should be possible to work around this in _tkinter.c:unicodeFromTclStringAndSize by merging surrogate pairs. This is with:
With Tk 8.6.8 (as included in the macOS installers on python.org) printing won't work at all, as mentioned in bpo-42225. |
Just to be sure, what is the result of pasting and executing the following code on Tk 8.6.8 and 8.6.10? print(ascii("😀")) |
Well, it is likely the same syntax error. Then what will print print(ascii(input())) when you paste 😀 and press Enter? |
With 8.6.8 both "hang", in that the Shell window no longer accepts input. I've checked that Interestingly enough, pasting print(ascii("😀"))print(ascii(" But with the first two identifiers coloured and the two other identifiers black. Saving the file results in the expected file contents. |
Well for me in Python 3.9.0 It does not Raises error "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." as you are suggesting |
With 8.6.10: >>> print(ascii("😀")) raises the SyntaxError mentioned earlier
>>> print(ascii(input())) works and prints:
'\udced\udca0\udcbd\udced\udcb8\udc84' In an editor window I don't get spurious text, but syntax colouring is a bit off: The text after the closing quote is coloured as if it is inside the string literal. That continues for the characters on the next line |
@Pixmew: I get this error with Tk 8.6.10 on macOS 11. With Tk 8.6.8 on macOS 10.15 (from the python.org installer) I get the behaviour described in msg380906. 8.6.10 is the version of Tk we'd like to switch to for the "universal2", it is the latest release in the 8.6.x branch and contains numerous bug fixes. The "Intel" installers (the ones currently on Python.org) we'll continue to use Tk 8.6.8 due to build issues on macOS 10.9 with newer Tk versions. |
BTW. The unicodeFromTclStringAndSize() basically undoes the special treatment of \0 in Modified UTF-8 [1]. That page says that all known implementation of MUTF-8 treat surrogate pairs the same as CESU-8 [2], which is UTF-8 with characters outside of the BMP encoded as surrogate pairs which are then converted to UTF-8. Neither encoding is currently supported by Python. [1] https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 |
Well, try copy 😀 (or other text with color emoji) to clipboard and run the following code: import tkinter
root = tkinter.Tk()
print(ascii(root.clipboard_get())) |
When I assign root.clipboard_get() to "v" I get: >>> print(ascii(v))
'\udced\udca0\udcbd\udced\udcb8\udc84'
>>> print(v)
?????? This is with Tk 8.6.10. |
And yet one question. What do you see if you print '\udcf0\udc9f\udc98\udc80' in IDLE? |
This prints a smiley emoji, likewise for printing chr(128516) |
Yash, this is specifically a macOS issue. Printing astral chars in tkinter/IDLE on Windows and Linux has 'worked' (details not important) for over a year. |
Oh, the fix is not backported yet. Automatically backporting does not work because of renames in the supporting test library. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: