classification
Title: Tkinter clipboard_get() decodes characters incorrectly
Type: behavior Stage: resolved
Components: Tkinter Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, loewis, ned.deily, python-dev, serhiy.storchaka, takluyver, terry.reedy
Priority: normal Keywords: patch

Created on 2012-05-10 22:47 by takluyver, last changed 2012-05-16 01:17 by ned.deily. This issue is now closed.

Files
File name Uploaded Description Edit
x11-clipboard-utf8.patch takluyver, 2012-05-12 17:21 clipboard_get and selection_get default to UTF8_STRING on X11 review
x11-clipboard-try-utf8.patch takluyver, 2012-05-13 20:21 2nd revision of patch review
x11-clipboard-try-utf8-3.patch takluyver, 2012-05-13 21:29 3rd revision of patch review
x11-clipboard-try-utf8-4.patch ned.deily, 2012-05-14 00:40 review
x11-clipboard-try-utf8-4_27.patch ned.deily, 2012-05-14 00:40
Messages (36)
msg160378 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-10 22:47
With the text 'abc€' copied to the clipboard, on Linux, where UTF-8 is the default encoding:

Python 3.2.3 (default, Apr 12 2012, 21:55:50) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tkinter
>>> root = tkinter.Tk()
>>> root.clipboard_get()
'abcâ\x82¬'
>>> 'abc€'.encode('utf-8').decode('latin-1')
'abcâ\x82¬'

I see the same behaviour in 2.7.3 as well (it returns a unicode string u'abc\xe2\x82\xac').

If the clipboard is only accessible at a bytes level, I think clipboard_get should return a bytes object. But I can reliably copy and paste non-ascii characters between programs, so it looks like it's possible to return unicode.
msg160379 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-10 23:09
Still worse. I get 'abc?'. Linux, Python 3.1, 3.2, and 3.3, UTF-8 locale.
msg160419 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-05-11 16:39
3.3, Win 7, Idle
>>> root.clipboard_get()
'abc€'
after cut from here
msg160438 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-11 18:41
This issue can be reproduced by pure Tcl/Tk:

$ wish
% clipboard get
abc?
% clipboard get -type STRING
abc?
% clipboard get -type UTF8_STRING
abc€

Use `root.clipboard_get(type='UTF8_STRING')` in Python.

I don't know whether it should just be documented (UTF8_STRING is not even mentioned in the clipboard_get docstring), or do we need to change the default behavior.
msg160440 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-11 19:09
On this computer, I see this from Tcl:

$ wish
% clipboard get
abc\u20ac

But here Python's following suit:

>>> root.clipboard_get()
'abc\\u20ac'

Which is odd, because as far as I know, my two computers run the same OS (Ubuntu 12.04) in the same configuration. I briefly thought the presence of xsel might be affecting it, but uninstalling it doesn't seem to make any difference.
msg160441 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-11 19:24
As is often the case with Tcl/Tk issues, there are platform differences.  On OS X, with the two native Tcl/Tk implementations (Aqua Cocoa and Aqua Carbon), the examples work appear to work as is *and* type "UTF8_STRING" does not exist.  The less commonly used X11 Tcl/Tk on OS X does support and require "UTF8_STRING" for the example given.  So any doc change needs to be carefully worded.
msg160444 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-11 19:31
OK, after a quick bit of reading, I see why I'm confused: the clipboard actually works by requesting the text from the source program, so where you copy it from makes a difference. In my case, copying from firefox gives 'abc\\u20ac', and copying from Geany gives u'abc\xe2\x82\xac'.

However, I still think there's something that can be improved in Python. As Serhiy points out, specifying type='UTF8_STRING' makes it work properly from both programs. The Tcl documentation recommends this as the best option for "modern X11 systems"[1].

From what Ned says, we can't make UTF8_STRING the default everywhere, but is there a way to detect if we're inside X11, and use UTF8_STRING by default there?

[1] http://www.tcl.tk/man/tcl/TkCmd/clipboard.htm
msg160450 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-05-11 21:02
There are definitely platform differences. As I noted, the original example works fine on Windows. However

>>> root.clipboard_get(type='STRING')
'abc€'
>>> root.clipboard_get(type='UTF8_STRING')
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    root.clipboard_get(type='UTF8_STRING')
  File "C:\Programs\Python33\lib\tkinter\__init__.py", line 549, in clipboard_get
    return self.tk.call(('clipboard', 'get') + self._options(kw))
_tkinter.TclError: CLIPBOARD selection doesn't exist or form "UTF8_STRING" not defined

Of course, on Windows I suspect that the unicode string is not copied to clipboard as utf8 bytes, so if clipboard contents are tagged, there would not be such a thing. Perhaps clipboards work differently on diffferent OSes.

>>> help(root.clipboard_get)
...
    The type keyword specifies the form in which the data is
    to be returned and should be an atom name such as STRING
    or FILE_NAME.  Type defaults to STRING.

(Actually, FILE_NAME give the same exception as UTF8_STRING.)
msg160451 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-11 21:19
Most likely the best way to determine the windowing system is to use the "tk windowingsystem" command (http://www.tcl.tk/man/tcl8.5/TkCmd/tk.htm#M10), so something like this:

    root = tkinter.Tk()
    root.call(('tk', 'windowingsystem'))

As documented, the call returns 'x11' for X11-based systems, 'win32' for Windows, and 'aqua' for the native OS X implementations.
msg160452 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-11 21:25
Thanks, Ned.

Does it seem like a good idea to test the windowing system like that, and default to UTF8_STRING if it's x11? So far, I've not found any case on X where STRING works but UTF8_STRING doesn't. If it seems reasonable, I'm happy to have a go at making a patch.
msg160456 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-11 22:02
A patch would be great.  I don't have a strong opinion about the issue one way or another.  I suppose it would simplify things for Python 3 users if the clipboard results were returned properly in the default case when no 'type' argument is passed to clipboard_get().  For Python 2, changing things seems a little more questionable but, as long as it was already returning a unicode object in that case, it sounds like a bug fix rather than a feature.  Martin, Andrew: any opinions on this?
msg160486 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-12 17:21
Here's a patch that makes UTF8_STRING the default type for clipboard_get and selection_get when running in X11.
msg160545 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012-05-13 18:50
Patch looks good for me, works fine.
I think it can be applied to 2.7 as well.
There are only problem: I don't know how to make test for it without using external tools like xclip or ctypes bindings for X so library.
msg160548 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 18:55
Indeed, and there don't seem to be any other tests for the clipboard functionality.
msg160551 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012-05-13 19:04
You are right: there are no tests as well as for the most part of tkinter.
Why don't make it if possible?
msg160552 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 19:13
I'm skeptical about the patch. In both 2.7 and 3.x, clipboard_get returns a Unicode string, yet it fails to decode it properly. So I think this is the bug that ought to be fixed (using the proper encoding).

Defaulting to UTF8_STRING is a new feature, IMO, and shouldn't be done for 2.7 (or 3.2).
msg160555 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-13 19:23
Martin, is there a way for _tkinter to know whether the result returned from Tcl/Tk is an encoded string or not in this case?

With regard to the patch, it would be better to cache the results of the first-time call to get the windowingsystem value so that we don't have to make two calls down into Tcl for each clipboard_get.
msg160556 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 19:29
У пт, 2012-05-11 у 21:25 +0000, Thomas Kluyver пише:
> So far, I've not found any case on X where STRING works but UTF8_STRING doesn't.

Perhaps there will be problems with the old (very old) closed source
software.

A few years ago (in Debian Sarge) even xsel did not work with the
non-ascii strings.
msg160557 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 19:33
But the encoding used seemingly depends on the source application - Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas Firefox escapes the unicode character. So I don't think we can correctly decode the STRING value in all cases.

The Tk documentation describes UTF8_STRING as being the "most useful" type on modern X11.
msg160559 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 19:38
> But the encoding used seemingly depends on the source application - Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas Firefox escapes the unicode character. So I don't think we can correctly decode the STRING value in all cases.

Agree. Opera sends 'abc?' literally.
msg160560 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 19:40
> Martin, is that a way for _tkinter to know whether the result
> returned from Tcl/Tk is an encoded string or not in this case?

Off-hand, I don't know. I suppose there is a way to do this correctly,
but one might have to dig through many layers of software to find out
what that way is.

> With regard to the patch, it would be better to cache the results of
> the first-time call to get the windowingsystem value so that we don't
> have to make two calls down into Tcl for each clipboard_get.

That also.
msg160561 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-13 19:43
> But the encoding used seemingly depends on the source application -
> Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas
> Firefox escapes the unicode character. So I don't think we can
> correctly decode the STRING value in all cases.

Ah, ok. IIUC, support for UTF8_STRING would also be in the realm of
the source application, right? If so, I think we should use something
more involved where we try UTF8_STRING first, and fall back to STRING
if the application doesn't support that.

This I could also accept for 2.7, since it "shouldn't" have a potential
for breakage.
msg160562 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-13 19:58
+1 to Martin's proposal
msg160563 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 19:59
OK, I'll produce an updated patch.
msg160569 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 20:21
As requested, the second version of the patch (x11-clipboard-try-utf8):

- Caches the windowing system per object. The tk call to find the windowing system is made the first time clipboard_get or selection_get are called without specifying `type=`.
- If using UTF8_STRING throws an error, it falls back to the default call with no type specified (i.e. STRING).
msg160571 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-13 20:34
Not to bikeshed here but I think it would be better to cache the windowingsystem value at the module level since I assume an application could be calling clipboard_get on different tkinter objects and I don't there is any possibility that the windowingsystem value could vary within one interpreter invocation.
msg160573 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 20:40
I'm happy to put the cache at the module level, but I'll give other people a chance to express their views before I dive into the code again.

I imagine most applications would only call clipboard_get() on one item, so it wouldn't matter. However, my own interest in this is from IPython, where we create a Tk object just to call clipboard_get() once, so a module level cache would be quicker, albeit only a tiny bit.
msg160576 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-13 20:49
> Not to bikeshed here but I think it would be better to cache the windowingsystem value at the module level since I assume an application could be calling clipboard_get on different tkinter objects and I don't there is any possibility that the windowingsystem value could vary within one interpreter invocation.

Why Misc.tk is not a module level variable?
msg160580 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-13 21:29
The 3rd revision of the patch has the cache at the module level. It's a bit awkward, because there's no module level function to call to retrieve it (as far as I know), so it's exposed by objects which can call Tk.

Also, serhiy pointed out a mistake in the 2nd revision, which is fixed ('selection' instead of 'clipboard').
msg160588 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-14 00:40
Serhiy, I don't know why Misc.Tk is not module level but it isn't so caching global attributes there isn't effective.  However, upon further consideration, I take back my original suggestion of caching at the module level primarily because I can think of future scenarios where it might be possible that there are different windowing systems supported in the same Python instance.  I now think the best solution is to cache at the Tk root object level; that appears to be a simple change to Thomas's 2nd revision.  Sorry about that!  Here is a fourth version (one for 3.x and one for 2.7) based on the second which includes the fix from the 3rd.

I started to write a simple test for the clipboard functions but then realized that there doesn't seem to be a practical way to effectively test in a machine-independent way without destroying the contents of the Tk clipboard and hence the user's desktop clipboard, not a friendly thing to do.  For example, the clipboard might contain a data type not supported by the platform's Tk, like pict data on OS X.  So I'm not including the test here but it did verify that the attribute was being properly cached across multiple tkinter objects.

Thanks to Thomas for the patch and to Serhiy for reviewing.  By the way, Thomas, for your patch to be included, you should submit a PSF contributor agreement as described here:  http://www.python.org/psf/contrib/.  Once that is in place and if the patch looks good to everyone, I'll apply it.
msg160714 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-15 11:31
I've submitted the contributor agreement, though I've not yet heard anything back about it.
msg160716 - (view) Author: Thomas Kluyver (takluyver) * Date: 2012-05-15 11:43
...And mere minutes after I said I hadn't heard anything, I've got the confirmation email. :-)
msg160718 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-15 11:56
> ...And mere minutes after I said I hadn't heard anything, I've got the confirmation email. :-)

Congratulations!
msg160722 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2012-05-15 12:38
I'm ok with last patch version.
msg160789 - (view) Author: Roundup Robot (python-dev) Date: 2012-05-16 01:14
New changeset f70fa654f70e by Ned Deily in branch '2.7':
Issue #14777: In an X11 windowing environment, tkinter may return
http://hg.python.org/cpython/rev/f70fa654f70e

New changeset 41382250e5e1 by Ned Deily in branch '3.2':
Issue #14777: In an X11 windowing environment, tkinter may return
http://hg.python.org/cpython/rev/41382250e5e1

New changeset 97601cbf169f by Ned Deily in branch 'default':
Issue #14777: merge
http://hg.python.org/cpython/rev/97601cbf169f
msg160790 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-05-16 01:17
Applied for release in 2.7.4, 3.2.4 and 3.3.0.  Thanks all!
History
Date User Action Args
2012-05-16 01:17:32ned.deilysetstatus: open -> closed
resolution: fixed
messages: + msg160790

stage: patch review -> resolved
2012-05-16 01:14:05python-devsetnosy: + python-dev
messages: + msg160789
2012-05-15 12:38:32asvetlovsetmessages: + msg160722
2012-05-15 11:56:36serhiy.storchakasetmessages: + msg160718
2012-05-15 11:43:57takluyversetmessages: + msg160716
2012-05-15 11:31:06takluyversetmessages: + msg160714
2012-05-14 00:40:46ned.deilysetfiles: + x11-clipboard-try-utf8-4_27.patch
2012-05-14 00:40:22ned.deilysetfiles: + x11-clipboard-try-utf8-4.patch

messages: + msg160588
stage: patch review
2012-05-13 21:29:47takluyversetfiles: + x11-clipboard-try-utf8-3.patch

messages: + msg160580
2012-05-13 20:49:51serhiy.storchakasetmessages: + msg160576
2012-05-13 20:40:41takluyversetmessages: + msg160573
2012-05-13 20:34:32ned.deilysetmessages: + msg160571
2012-05-13 20:21:00takluyversetfiles: + x11-clipboard-try-utf8.patch

messages: + msg160569
2012-05-13 19:59:43takluyversetmessages: + msg160563
2012-05-13 19:58:21ned.deilysetmessages: + msg160562
2012-05-13 19:43:02loewissetmessages: + msg160561
2012-05-13 19:40:24loewissetmessages: + msg160560
2012-05-13 19:38:10serhiy.storchakasetmessages: + msg160559
2012-05-13 19:33:39takluyversetmessages: + msg160557
2012-05-13 19:29:56serhiy.storchakasetmessages: + msg160556
2012-05-13 19:23:05ned.deilysetmessages: + msg160555
2012-05-13 19:22:51ned.deilysetmessages: - msg160554
2012-05-13 19:22:24ned.deilysetmessages: + msg160554
2012-05-13 19:13:29loewissetmessages: + msg160552
2012-05-13 19:04:32asvetlovsetmessages: + msg160551
2012-05-13 18:55:49takluyversetmessages: + msg160548
2012-05-13 18:50:55asvetlovsetmessages: + msg160545
2012-05-12 17:21:21takluyversetfiles: + x11-clipboard-utf8.patch
keywords: + patch
messages: + msg160486
2012-05-11 22:02:24ned.deilysetnosy: + loewis, asvetlov
messages: + msg160456
2012-05-11 21:25:42takluyversetmessages: + msg160452
2012-05-11 21:19:01ned.deilysetmessages: + msg160451
2012-05-11 21:02:01terry.reedysetmessages: + msg160450
2012-05-11 19:31:32takluyversetmessages: + msg160444
2012-05-11 19:24:14ned.deilysetnosy: + ned.deily
messages: + msg160441
2012-05-11 19:09:53takluyversetmessages: + msg160440
2012-05-11 18:41:49serhiy.storchakasetmessages: + msg160438
2012-05-11 16:39:00terry.reedysetnosy: + terry.reedy
messages: + msg160419
2012-05-10 23:09:13serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg160379
versions: + Python 3.3
2012-05-10 22:47:56takluyvercreate