This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode - encoding seems to be lost for inputs of unicode chars in IDLE
Type: behavior Stage: resolved
Components: IDLE Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: 2.7 IDLE console uses incorrect encoding.
View: 15809
Assigned To: Nosy List: THRlWiTi, Tomoki.Imai, ezio.melotti, ned.deily, pradyunsg, r.david.murray, roger.serwy, terry.reedy
Priority: normal Keywords: patch

Created on 2013-03-04 10:49 by pradyunsg, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
PyShell.py.20130422.diff Tomoki.Imai, 2013-04-21 16:34 patch for Lib/idlelib/PyShell.py
Messages (11)
msg183431 - (view) Author: Pradyun Gedam (pradyunsg) * Date: 2013-03-04 10:49
In IDLE, I have spotted a peculiar problem.

I have attached an .png file which is a screen capture of 'session' on IDLE. It seems that the Unicode character that has been input, loses its encoding.

My 'session'
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> c = u'€'
>>> ord(c)
128
>>> c.encode('utf-8')
'\xc2\x80'
>>> c
u'\x80'
>>> print c
€
>>> c = u'\u20ac'
>>> ord(c)
8364
>>> c.encode('utf-8')
'\xe2\x82\xac'
>>> c
u'\u20ac'
>>> print c
€
>>>
msg183783 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-03-09 02:25
I do not see any bug. Unicode chars do not have an encoding (except internally) The .encode() method encodes the the unicode string to a byte string. It does *not* mutate the string. Since you do not bind the byte string to anything, it disappears. Compare

>>> c = u'\u20ac'
>>> b = c.encode()
>>> c
'€'
>>> b
b'\xe2\x82\xac'

Now you have both the unicode string and the utf-8 encoded byte string that represents the char.

>>> b.decode()
'€'

If you have any more questions, please reread the tutorial or ask on python-list or even the tutor list. Also post there about any 'problems' you find.
msg187513 - (view) Author: Tomoki Imai (Tomoki.Imai) Date: 2013-04-21 16:34
NO,this thread should not be closed!
This is IDLE Bug.I found, IDLE has issue in using unicode literal.

In normal interpreter in console.
>>> u"こんにちは"
u'\u3053\u3093\u306b\u3061\u306f'

In IDLE.
>>> u"こんにちは"
u'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'

I take a look IDLE codes, found bug in IDLE.
In idlelib/PyShell.py.

    def runsource(self, source):
        "Extend base class method: Stuff the source in the line cache first"
        filename = self.stuffsource(source)
        self.more = 0
        self.save_warnings_filters = warnings.filters[:]
        warnings.filterwarnings(action="error", category=SyntaxWarning)
        print(source,len(source))

        if isinstance(source, types.UnicodeType):
            from idlelib import IOBinding
            try:
                source = source.encode(IOBinding.encoding)
            except UnicodeError:
                self.tkconsole.resetoutput()
                self.write("Unsupported characters in input\n")
                return
        try:
            print(source,len(source))
            # InteractiveInterpreter.runsource() calls its runcode() method,
            # which is overridden (see below)
            return InteractiveInterpreter.runsource(self, source, filename)
        finally:
            if self.save_warnings_filters is not None:
                warnings.filters[:] = self.save_warnings_filters
                self.save_warnings_filters = None


This codes change u"こんにちは" to u'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'
I commented out  following lines.

        if isinstance(source, types.UnicodeType):
            from idlelib import IOBinding
            try:
                source = source.encode(IOBinding.encoding)
            except UnicodeError:
                self.tkconsole.resetoutput()
                self.write("Unsupported characters in input\n")
                return

And now works.
Not well tested, I'll do unittest in GSoC (if I can).
msg187519 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-21 20:02
I believe you have indeed understood what the original poster was reporting.

However, those lines date back a long time (2002 or earlier).  They exist in Python2 only, and there they have a purpose, so they can't just be deleted.

My guess is the problem is a conflict between the locale setting and the encoding used when the character string is input into IDLE.

For me, if I cut and paste that string into the idle shell in python2, it shows up as the unicode escape characters (meaning IDLE is doing the correct conversion at input time on my system).  In Python3 it looks the same, except that the echoed output shows the expected glyphs instead of the unicode escapes as it does in Python2, which is as expected.

My only locale setting, by the way, is LC_CTYPE=en_US.UTF-8.  What is your setting?

I don't know if there is a better way for idle to behave in the modern era or not.  Perhaps it should be using utf-8 by default instead of the locale?  Do you know how (and in what charset) your system is generating the characters you type into idle?
msg187537 - (view) Author: Tomoki Imai (Tomoki.Imai) Date: 2013-04-21 23:19
Thanks.

I noticed Terry used python3 to confirm this problem...

I am Japanese, but using English environment.
Here is my locale settings. And I'm using Linux.
konomi:tomoki% locale                                    
LANG=en_US.utf8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

All strings used internally should be unicode type.
In Japan, many many charset is here.(cp932,euc-jp,...).
And, they causes problems in Python2 without converting it to unicode type.
Remember, unicode type and "utf-8" is not same.

When I type into Tkinter's Entry and get Entry's value,it returned me unicode.
And deleted code converts unicode to str type.
They are unified in Python3.(unicode become str,and str become byte).
So, these lines are not in Python3 codes.

I typed these strings using "Input Method"(am using uim).
https://code.google.com/p/uim/
But, I don't know how uim generate these characters.
msg187541 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-22 00:43
Well, it does seem to me that there is something wrong here.  Your fix may even be correct, but I'd hesitate to apply it without someone understanding why those lines were added in the first place.  (I *think* they were added by Martin von Loewis, but I'm not 100% sure since the commit was part of a block of changes by different authors and I'm just guessing based on the comment on the commit.)

I'm reopening the issue.  I'll have to leave it to the idle team (or you) to figure out what the correct fix is that doesn't break anything :)
msg187542 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-04-22 01:26
When discussing problematical behavior, one should specify OS and exact Python version, including bugfix number. If at all possible, one should use the latest bugfix release with all released bugfixes. 2.7.3 came out 10+ months before the original report. I do not presume without evidence that it has the same behavior as the 2.7.2. The recently released 2.7.4 has another year of bugfixes, so it might also behave differently.

Looking again at the original report, I see that the false issue of lost encoding obscured to me a real problem: ord(u'€') is 8364, not 128. Does 2.7.4 make the same error for that input? What does it do with u"こんにちは"?

(Note, on the Windows console, both keying and viewing unicode chars is problematical, apparently more so that with the *nix consoles. If I could not paste, u"こんにちは", I would most likely just key u'\u3053\u3093\u306b\u3061\u306f'.)

I believe the underlying problem is that a Python 2 program is a stream of bytes while a Python 3 program is a stream of unicode codepoints. So in Python 2, a unicode literal has to be encoded to bytes before being decoded back to unicode codepoints in a unicode string object.

David, I presume this is why you say we cannot just toss out the encoding to bytes. I presume that you are also suggesting that the encoding and subsequent decoding are done with different codecs because of locale issues. Might IOBinding.encoding be miscalculated?

For ascii codepoints, the encoding and decoding is typically a null operation. This means that \u#### escapes, as opposed to non-ascii codepoints, should not get mangled before being interpreted during the creation of the unicode object. Using such escapes is one solution to the problem.

Another is to use Python 3. That *is* the generic answer to many Python 2.x unicode problems. In 3.3.1:
>>> u"こんにちは"
'こんにちは'
problem solved ;-).

In other words, fixing 2.7-only unicode bugs has fairly low priority in general. However, if there is an easy fix here that Roger thinks is safe, it can be applied.
msg187546 - (view) Author: Tomoki Imai (Tomoki.Imai) Date: 2013-04-22 03:17
Sorry.I forgot to note my environment.

I'm using Arch Linux.
$ uname -a
Linux manaka 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 09:01:47 CEST 2013 x86_64 GNU/Linux

And python version is here.
$ python --version
Python 2.7.4

IDLE's version is same, 2.7.4 downloaded from following link.
http://www.python.org/download/releases/2.7.4/

In IDLE,I repeated original author's attempts.

Python 2.7.4 (default, Apr  6 2013, 19:20:36)
[GCC 4.8.0] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> c = u'€'
>>> ord(c)

Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    ord(c)
TypeError: ord() expected a character, but string of length 3 found
>>> c.encode('utf-8')
'\xc3\xa2\xc2\x82\xc2\xac'
>>> c
u'\xe2\x82\xac'
>>> print c
€
>>> c = u'\u20ac'
>>> ord(c)
8364
>>> c.encode('utf-8')
'\xe2\x82\xac'
>>> c
u'\u20ac'
>>> print c
€
>>>

I have a problem.But it is different from original.
After my fix.

Python 2.7.4 (default, Apr  6 2013, 19:20:36)
[GCC 4.8.0] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> c = u'€'
>>> ord(c)
8364
>>> c.encode('utf-8')
'\xe2\x82\xac'
>>> c
u'\u20ac'
>>> print c
€
>>>

It works.

Using unicode escape is one solution.
But, we Japanese can type u'こんにちは' just in 10 or 5 key types.
And other people who use unicode literals for their language have same situation.
Why IDLE users (probably beginner) use such workaround ?

Of cource, using Python3 is best way.
All beginner should start from Python3 now.
But, there are people including me who have to use python2 because of libraries .
msg187548 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2013-04-22 04:46
Also see Issue15809 in which Martin proposed the same patch but then explained why it isn't totally correct.
msg187549 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-22 05:32
For those of us without fancy input methods, I was able to see the same problem using a simple non-ascii letter (an accented a: á.  You will note that my stdin encoding ought to be utf-8, so I'm not sure why it fails (but I didn't check that).  Removing the lines in question makes it work, but as Martin says in the referenced issue I won't be surprised if that breaks working with a file that has a non-utf8 coding cookie.

If Martin sees no way to make it work in Python2 it makes it pretty unlikely we'll find a fix :(.
msg187559 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-22 12:45
Serhiy has proposed a patch on the older issue, closing this one since it is a duplicate.
History
Date User Action Args
2022-04-11 14:57:42adminsetgithub: 61550
2013-08-20 15:04:05THRlWiTisetnosy: + THRlWiTi
2013-04-22 12:45:51r.david.murraysetstatus: open -> closed

messages: + msg187559
stage: resolved
2013-04-22 10:01:27serhiy.storchakasetsuperseder: 2.7 IDLE console uses incorrect encoding.
resolution: duplicate
2013-04-22 05:32:39r.david.murraysetresolution: not a bug -> (no value)
messages: + msg187549
stage: resolved -> (no value)
2013-04-22 04:50:19ned.deilysetmessages: - msg187547
2013-04-22 04:46:23ned.deilysetmessages: + msg187548
2013-04-22 04:41:51ned.deilysetnosy: + ned.deily
messages: + msg187547
2013-04-22 03:17:42Tomoki.Imaisetmessages: + msg187546
2013-04-22 01:26:52terry.reedysetresolution: not a bug
messages: + msg187542
stage: resolved
2013-04-22 00:43:32r.david.murraysetstatus: closed -> open

nosy: + roger.serwy
messages: + msg187541

resolution: not a bug -> (no value)
stage: resolved -> (no value)
2013-04-21 23:19:42Tomoki.Imaisetmessages: + msg187537
2013-04-21 20:03:10r.david.murraysettitle: Unicode - encoding seems to be lost for inputs of unicode chars -> Unicode - encoding seems to be lost for inputs of unicode chars in IDLE
2013-04-21 20:02:01r.david.murraysetnosy: + r.david.murray
messages: + msg187519
2013-04-21 16:34:56Tomoki.Imaisetfiles: + PyShell.py.20130422.diff

type: behavior
components: + IDLE, - Unicode

keywords: + patch
nosy: + Tomoki.Imai
messages: + msg187513
2013-03-09 02:25:07terry.reedysetstatus: open -> closed

components: + Unicode, - IDLE

nosy: + ezio.melotti, terry.reedy
messages: + msg183783
resolution: not a bug
stage: resolved
2013-03-04 10:49:37pradyunsgcreate