Issue 22999: Copying emoji to Windows clipboard corrupts string in Python 3.3 and up

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67188

classification

Title:	Copying emoji to Windows clipboard corrupts string in Python 3.3 and up
Type:	behavior	Stage:
Components:	Unicode	Versions:	Python 3.3

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Cees.Timmerman, amaury.forgeotdarc, ezio.melotti, vstinner
Priority:	normal	Keywords:

Created on 2014-12-05 09:58 by Cees.Timmerman, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_clipboard_win.py	Cees.Timmerman, 2014-12-05 10:11

Messages (4)
msg232188 - (view)	Author: Cees Timmerman (Cees.Timmerman)	Date: 2014-12-05 09:58
# http://stackoverflow.com/a/25678113/819417 def copy(data): if not isinstance(data, unicode): data = data.decode('mbcs') OpenClipboard(None) EmptyClipboard() hCd = GlobalAlloc(GMEM_DDESHARE, 2 * (len(data) + 1)) pchData = GlobalLock(hCd) wcscpy(ctypes.c_wchar_p(pchData), data) GlobalUnlock(hCd) SetClipboardData(CF_UNICODETEXT, hCd) CloseClipboard() Emoji "📋" (\U0001f400) is copied as "🐀" (\U0001f4cb), or "📋." turns to "📋" (note the period). It works fine in Python 3.2.5.
msg232189 - (view)	Author: Cees Timmerman (Cees.Timmerman)	Date: 2014-12-05 10:11
A copy of my test program at https://gist.github.com/CTimmerman/133cb80100357dde92d8
msg232190 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2014-12-05 10:32
(you swapped the unicode values: \U0001f4cb is copied as \U0001f400) On Windows, strings have changed in 3.3. See in https://docs.python.org/3/whatsnew/3.3.html, "len() now always returns 1 for non-BMP characters". The call to GlobalAlloc should use the number of wchar_t units, something like len(data.encode('utf-16')) + 2
msg232191 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2014-12-05 10:36
Better use utf-16-le encoding: len(data.encode('utf-16-le')) + 2 otherwise the encoded bytes start with the \fffe BOM.

History
Date	User	Action	Args
2022-04-11 14:58:10	admin	set	github: 67188
2014-12-05 10:36:01	amaury.forgeotdarc	set	messages: + msg232191
2014-12-05 10:32:39	amaury.forgeotdarc	set	status: open -> closed nosy: + amaury.forgeotdarc messages: + msg232190 resolution: not a bug
2014-12-05 10:11:03	Cees.Timmerman	set	files: + test_clipboard_win.py messages: + msg232189
2014-12-05 09:58:05	Cees.Timmerman	create