Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tkinter] surrogate pairs in Tcl/Tk string when pasting an emoji in a text widget #86484

Closed
ronaldoussoren opened this issue Nov 10, 2020 · 18 comments
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-mac topic-tkinter type-bug An unexpected behavior, bug, or error

Comments

@ronaldoussoren
Copy link
Contributor

BPO 42318
Nosy @terryjreedy, @ronaldoussoren, @ned-deily, @serhiy-storchaka, @miss-islington, @Pixmew
PRs
  • bpo-42318: Fix support of non-BMP characters in Tkinter on macOS #23281
  • [3.9] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). #23784
  • [3.8] [3.9] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). (GH-23784) #23787
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-12-25.22:36:49.380>
    created_at = <Date 2020-11-10.21:22:18.169>
    labels = ['OS-mac', 'type-bug', 'expert-tkinter', '3.9', '3.10', '3.8']
    title = '[tkinter] surrogate pairs in Tcl/Tk string when pasting an emoji in a text widget'
    updated_at = <Date 2020-12-25.22:36:49.380>
    user = 'https://github.com/ronaldoussoren'

    bugs.python.org fields:

    activity = <Date 2020-12-25.22:36:49.380>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-12-25.22:36:49.380>
    closer = 'serhiy.storchaka'
    components = ['macOS', 'Tkinter']
    creation = <Date 2020-11-10.21:22:18.169>
    creator = 'ronaldoussoren'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 42318
    keywords = ['patch']
    message_count = 18.0
    messages = ['380715', '380879', '380881', '380906', '380907', '380908', '380909', '380910', '380917', '380918', '380919', '380920', '380924', '380953', '381019', '383077', '383088', '383775']
    nosy_count = 6.0
    nosy_names = ['terry.reedy', 'ronaldoussoren', 'ned.deily', 'serhiy.storchaka', 'miss-islington', 'Pixmew']
    pr_nums = ['23281', '23784', '23787']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue42318'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @ronaldoussoren
    Copy link
    Contributor Author

    As mentioned in msg380552: I get an SyntaxError with message "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." when I paste a smiley emoji in an IDLE interactive shell and try to execute that line, for example using:

    >> print("😀")

    The error is likely due to a surrogate pair being present in the UTF-8 representation of a Tcl/Tk string.

    It should be possible to work around this in _tkinter.c:unicodeFromTclStringAndSize by merging surrogate pairs.

    This is with:

    • Python 3.10
    • macOS 11 (arm64)
    • Tk 8.6.10

    With Tk 8.6.8 (as included in the macOS installers on python.org) printing won't work at all, as mentioned in bpo-42225.

    @ronaldoussoren ronaldoussoren added OS-mac topic-tkinter type-bug An unexpected behavior, bug, or error labels Nov 10, 2020
    @serhiy-storchaka
    Copy link
    Member

    Just to be sure, what is the result of pasting and executing the following code on Tk 8.6.8 and 8.6.10?

        print(ascii("😀"))

    @serhiy-storchaka serhiy-storchaka added 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Nov 13, 2020
    @serhiy-storchaka
    Copy link
    Member

    Well, it is likely the same syntax error. Then what will print

        print(ascii(input()))

    when you paste 😀 and press Enter?

    @ronaldoussoren
    Copy link
    Contributor Author

    With 8.6.8 both "hang", in that the Shell window no longer accepts input. I've checked that print(input()) works when I don't use an emoji.

    Interestingly enough, pasting print(ascii("😀")) into an edit window does work, I can continue editing, but the display is messed up. It looks like:

    print(ascii("😀"))print(ascii("

    But with the first two identifiers coloured and the two other identifiers black. Saving the file results in the expected file contents.

    @Pixmew
    Copy link
    Mannequin

    Pixmew mannequin commented Nov 13, 2020

    Well for me in Python 3.9.0
    print("😀") prints 😀 and
    print(ascii("😀")) prints '\U0001f600'

    It does not Raises error "utf-8' codec can't encode characters in position 7-12: surrogates not allowed." as you are suggesting

    @ronaldoussoren
    Copy link
    Contributor Author

    With 8.6.10:

    >>> print(ascii("😀")) raises the SyntaxError mentioned earlier
    >>> print(ascii(input())) works and prints:
    '\udced\udca0\udcbd\udced\udcb8\udc84'

    In an editor window I don't get spurious text, but syntax colouring is a bit off: The text after the closing quote is coloured as if it is inside the string literal. That continues for the characters on the next line

    @ronaldoussoren
    Copy link
    Contributor Author

    @Pixmew: I get this error with Tk 8.6.10 on macOS 11. With Tk 8.6.8 on macOS 10.15 (from the python.org installer) I get the behaviour described in msg380906.

    8.6.10 is the version of Tk we'd like to switch to for the "universal2", it is the latest release in the 8.6.x branch and contains numerous bug fixes.

    The "Intel" installers (the ones currently on Python.org) we'll continue to use Tk 8.6.8 due to build issues on macOS 10.9 with newer Tk versions.

    @ronaldoussoren
    Copy link
    Contributor Author

    BTW. The unicodeFromTclStringAndSize() basically undoes the special treatment of \0 in Modified UTF-8 [1]. That page says that all known implementation of MUTF-8 treat surrogate pairs the same as CESU-8 [2], which is UTF-8 with characters outside of the BMP encoded as surrogate pairs which are then converted to UTF-8.

    Neither encoding is currently supported by Python.

    [1] https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
    [2] https://en.wikipedia.org/wiki/CESU-8

    @serhiy-storchaka
    Copy link
    Member

    Well, try copy 😀 (or other text with color emoji) to clipboard and run the following code:

    import tkinter
    root = tkinter.Tk()
    print(ascii(root.clipboard_get()))

    @ronaldoussoren
    Copy link
    Contributor Author

    When I assign root.clipboard_get() to "v" I get:

    >>> print(ascii(v))
    '\udced\udca0\udcbd\udced\udcb8\udc84'
    >>> print(v)
    ??????

    This is with Tk 8.6.10.

    @serhiy-storchaka
    Copy link
    Member

    You can ignore msg380917. It was written before I read msg380908. Now I have the needed information. Thank you.

    @serhiy-storchaka
    Copy link
    Member

    And yet one question. What do you see if you print '\udcf0\udc9f\udc98\udc80' in IDLE?

    @ronaldoussoren
    Copy link
    Contributor Author

    And yet one question. What do you see if you print '\udcf0\udc9f\udc98\udc80' in IDLE?

    This prints a smiley emoji, likewise for printing chr(128516)

    @terryjreedy
    Copy link
    Member

    Yash, this is specifically a macOS issue. Printing astral chars in tkinter/IDLE on Windows and Linux has 'worked' (details not important) for over a year.

    @serhiy-storchaka
    Copy link
    Member

    New changeset a26215d by Serhiy Storchaka in branch 'master':
    bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281)
    a26215d

    @serhiy-storchaka
    Copy link
    Member

    Oh, the fix is not backported yet.

    Automatically backporting does not work because of renames in the supporting test library.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 28bf6ab by Serhiy Storchaka in branch '3.9':
    [3.9] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). (GH-23784)
    28bf6ab

    @serhiy-storchaka
    Copy link
    Member

    New changeset 4d840e4 by Miss Islington (bot) in branch '3.8':
    [3.8] bpo-42318: Fix support of non-BMP characters in Tkinter on macOS (GH-23281). (GH-23784) (GH-23787)
    4d840e4

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes OS-mac topic-tkinter type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants