This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Copying emoji to Windows clipboard corrupts string in Python 3.3 and up
Type: behavior Stage:
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Cees.Timmerman, amaury.forgeotdarc, ezio.melotti, vstinner
Priority: normal Keywords:

Created on 2014-12-05 09:58 by Cees.Timmerman, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test_clipboard_win.py Cees.Timmerman, 2014-12-05 10:11
Messages (4)
msg232188 - (view) Author: Cees Timmerman (Cees.Timmerman) Date: 2014-12-05 09:58
# http://stackoverflow.com/a/25678113/819417
def copy(data):
    if not isinstance(data, unicode):
        data = data.decode('mbcs')
    OpenClipboard(None)
    EmptyClipboard()
    hCd = GlobalAlloc(GMEM_DDESHARE, 2 * (len(data) + 1))
    pchData = GlobalLock(hCd)
    wcscpy(ctypes.c_wchar_p(pchData), data)
    GlobalUnlock(hCd)
    SetClipboardData(CF_UNICODETEXT, hCd)
    CloseClipboard()

Emoji "📋" (\U0001f400) is copied as "🐀" (\U0001f4cb), or "📋." turns to "📋" (note the period).

It works fine in Python 3.2.5.
msg232189 - (view) Author: Cees Timmerman (Cees.Timmerman) Date: 2014-12-05 10:11
A copy of my test program at https://gist.github.com/CTimmerman/133cb80100357dde92d8
msg232190 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2014-12-05 10:32
(you swapped the unicode values: \U0001f4cb is copied as \U0001f400)

On Windows, strings have changed in 3.3. See in https://docs.python.org/3/whatsnew/3.3.html, "len() now always returns 1 for non-BMP characters".

The call to GlobalAlloc should use the number of wchar_t units, something like len(data.encode('utf-16')) + 2
msg232191 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2014-12-05 10:36
Better use utf-16-le encoding:
  len(data.encode('utf-16-le')) + 2
otherwise the encoded bytes start with the \fffe BOM.
History
Date User Action Args
2022-04-11 14:58:10adminsetgithub: 67188
2014-12-05 10:36:01amaury.forgeotdarcsetmessages: + msg232191
2014-12-05 10:32:39amaury.forgeotdarcsetstatus: open -> closed

nosy: + amaury.forgeotdarc
messages: + msg232190

resolution: not a bug
2014-12-05 10:11:03Cees.Timmermansetfiles: + test_clipboard_win.py

messages: + msg232189
2014-12-05 09:58:05Cees.Timmermancreate