Issue 17348: Unicode - encoding seems to be lost for inputs of unicode chars in IDLE

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61550

classification

Title:	Unicode - encoding seems to be lost for inputs of unicode chars in IDLE
Type:	behavior	Stage:	resolved
Components:	IDLE	Versions:	Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	2.7 IDLE console uses incorrect encoding. View: 15809
Assigned To:		Nosy List:	THRlWiTi, Tomoki.Imai, ezio.melotti, ned.deily, pradyunsg, r.david.murray, roger.serwy, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2013-03-04 10:49 by pradyunsg, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
PyShell.py.20130422.diff	Tomoki.Imai, 2013-04-21 16:34	patch for Lib/idlelib/PyShell.py

Messages (11)
msg183431 - (view)	Author: Pradyun Gedam (pradyunsg) *	Date: 2013-03-04 10:49
In IDLE, I have spotted a peculiar problem. I have attached an .png file which is a screen capture of 'session' on IDLE. It seems that the Unicode character that has been input, loses its encoding. My 'session' Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> c = u'€' >>> ord(c) 128 >>> c.encode('utf-8') '\xc2\x80' >>> c u'\x80' >>> print c >>> c = u'\u20ac' >>> ord(c) 8364 >>> c.encode('utf-8') '\xe2\x82\xac' >>> c u'\u20ac' >>> print c € >>>
msg183783 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-03-09 02:25
I do not see any bug. Unicode chars do not have an encoding (except internally) The .encode() method encodes the the unicode string to a byte string. It does not mutate the string. Since you do not bind the byte string to anything, it disappears. Compare >>> c = u'\u20ac' >>> b = c.encode() >>> c '€' >>> b b'\xe2\x82\xac' Now you have both the unicode string and the utf-8 encoded byte string that represents the char. >>> b.decode() '€' If you have any more questions, please reread the tutorial or ask on python-list or even the tutor list. Also post there about any 'problems' you find.
msg187513 - (view)	Author: Tomoki Imai (Tomoki.Imai)	Date: 2013-04-21 16:34
NO,this thread should not be closed! This is IDLE Bug.I found, IDLE has issue in using unicode literal. In normal interpreter in console. >>> u"こんにちは" u'\u3053\u3093\u306b\u3061\u306f' In IDLE. >>> u"こんにちは" u'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf' I take a look IDLE codes, found bug in IDLE. In idlelib/PyShell.py. def runsource(self, source): "Extend base class method: Stuff the source in the line cache first" filename = self.stuffsource(source) self.more = 0 self.save_warnings_filters = warnings.filters[:] warnings.filterwarnings(action="error", category=SyntaxWarning) print(source,len(source)) if isinstance(source, types.UnicodeType): from idlelib import IOBinding try: source = source.encode(IOBinding.encoding) except UnicodeError: self.tkconsole.resetoutput() self.write("Unsupported characters in input\n") return try: print(source,len(source)) # InteractiveInterpreter.runsource() calls its runcode() method, # which is overridden (see below) return InteractiveInterpreter.runsource(self, source, filename) finally: if self.save_warnings_filters is not None: warnings.filters[:] = self.save_warnings_filters self.save_warnings_filters = None This codes change u"こんにちは" to u'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf' I commented out following lines. if isinstance(source, types.UnicodeType): from idlelib import IOBinding try: source = source.encode(IOBinding.encoding) except UnicodeError: self.tkconsole.resetoutput() self.write("Unsupported characters in input\n") return And now works. Not well tested, I'll do unittest in GSoC (if I can).
msg187519 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-04-21 20:02
I believe you have indeed understood what the original poster was reporting. However, those lines date back a long time (2002 or earlier). They exist in Python2 only, and there they have a purpose, so they can't just be deleted. My guess is the problem is a conflict between the locale setting and the encoding used when the character string is input into IDLE. For me, if I cut and paste that string into the idle shell in python2, it shows up as the unicode escape characters (meaning IDLE is doing the correct conversion at input time on my system). In Python3 it looks the same, except that the echoed output shows the expected glyphs instead of the unicode escapes as it does in Python2, which is as expected. My only locale setting, by the way, is LC_CTYPE=en_US.UTF-8. What is your setting? I don't know if there is a better way for idle to behave in the modern era or not. Perhaps it should be using utf-8 by default instead of the locale? Do you know how (and in what charset) your system is generating the characters you type into idle?
msg187537 - (view)	Author: Tomoki Imai (Tomoki.Imai)	Date: 2013-04-21 23:19
Thanks. I noticed Terry used python3 to confirm this problem... I am Japanese, but using English environment. Here is my locale settings. And I'm using Linux. konomi:tomoki% locale LANG=en_US.utf8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8" LC_ALL= All strings used internally should be unicode type. In Japan, many many charset is here.(cp932,euc-jp,...). And, they causes problems in Python2 without converting it to unicode type. Remember, unicode type and "utf-8" is not same. When I type into Tkinter's Entry and get Entry's value,it returned me unicode. And deleted code converts unicode to str type. They are unified in Python3.(unicode become str,and str become byte). So, these lines are not in Python3 codes. I typed these strings using "Input Method"(am using uim). https://code.google.com/p/uim/ But, I don't know how uim generate these characters.
msg187541 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-04-22 00:43
Well, it does seem to me that there is something wrong here. Your fix may even be correct, but I'd hesitate to apply it without someone understanding why those lines were added in the first place. (I think they were added by Martin von Loewis, but I'm not 100% sure since the commit was part of a block of changes by different authors and I'm just guessing based on the comment on the commit.) I'm reopening the issue. I'll have to leave it to the idle team (or you) to figure out what the correct fix is that doesn't break anything :)
msg187542 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-04-22 01:26
When discussing problematical behavior, one should specify OS and exact Python version, including bugfix number. If at all possible, one should use the latest bugfix release with all released bugfixes. 2.7.3 came out 10+ months before the original report. I do not presume without evidence that it has the same behavior as the 2.7.2. The recently released 2.7.4 has another year of bugfixes, so it might also behave differently. Looking again at the original report, I see that the false issue of lost encoding obscured to me a real problem: ord(u'€') is 8364, not 128. Does 2.7.4 make the same error for that input? What does it do with u"こんにちは"? (Note, on the Windows console, both keying and viewing unicode chars is problematical, apparently more so that with the nix consoles. If I could not paste, u"こんにちは", I would most likely just key u'\u3053\u3093\u306b\u3061\u306f'.) I believe the underlying problem is that a Python 2 program is a stream of bytes while a Python 3 program is a stream of unicode codepoints. So in Python 2, a unicode literal has to be encoded to bytes before being decoded back to unicode codepoints in a unicode string object. David, I presume this is why you say we cannot just toss out the encoding to bytes. I presume that you are also suggesting that the encoding and subsequent decoding are done with different codecs because of locale issues. Might IOBinding.encoding be miscalculated? For ascii codepoints, the encoding and decoding is typically a null operation. This means that \u#### escapes, as opposed to non-ascii codepoints, should not get mangled before being interpreted during the creation of the unicode object. Using such escapes is one solution to the problem. Another is to use Python 3. That is* the generic answer to many Python 2.x unicode problems. In 3.3.1: >>> u"こんにちは" 'こんにちは' problem solved ;-). In other words, fixing 2.7-only unicode bugs has fairly low priority in general. However, if there is an easy fix here that Roger thinks is safe, it can be applied.
msg187546 - (view)	Author: Tomoki Imai (Tomoki.Imai)	Date: 2013-04-22 03:17
Sorry.I forgot to note my environment. I'm using Arch Linux. $ uname -a Linux manaka 3.8.7-1-ARCH #1 SMP PREEMPT Sat Apr 13 09:01:47 CEST 2013 x86_64 GNU/Linux And python version is here. $ python --version Python 2.7.4 IDLE's version is same, 2.7.4 downloaded from following link. http://www.python.org/download/releases/2.7.4/ In IDLE,I repeated original author's attempts. Python 2.7.4 (default, Apr 6 2013, 19:20:36) [GCC 4.8.0] on linux2 Type "copyright", "credits" or "license()" for more information. >>> c = u'€' >>> ord(c) Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> ord(c) TypeError: ord() expected a character, but string of length 3 found >>> c.encode('utf-8') '\xc3\xa2\xc2\x82\xc2\xac' >>> c u'\xe2\x82\xac' >>> print c â¬ >>> c = u'\u20ac' >>> ord(c) 8364 >>> c.encode('utf-8') '\xe2\x82\xac' >>> c u'\u20ac' >>> print c € >>> I have a problem.But it is different from original. After my fix. Python 2.7.4 (default, Apr 6 2013, 19:20:36) [GCC 4.8.0] on linux2 Type "copyright", "credits" or "license()" for more information. >>> c = u'€' >>> ord(c) 8364 >>> c.encode('utf-8') '\xe2\x82\xac' >>> c u'\u20ac' >>> print c € >>> It works. Using unicode escape is one solution. But, we Japanese can type u'こんにちは' just in 10 or 5 key types. And other people who use unicode literals for their language have same situation. Why IDLE users (probably beginner) use such workaround ? Of cource, using Python3 is best way. All beginner should start from Python3 now. But, there are people including me who have to use python2 because of libraries .
msg187548 - (view)	Author: Ned Deily (ned.deily) *	Date: 2013-04-22 04:46
Also see Issue15809 in which Martin proposed the same patch but then explained why it isn't totally correct.
msg187549 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-04-22 05:32
For those of us without fancy input methods, I was able to see the same problem using a simple non-ascii letter (an accented a: á. You will note that my stdin encoding ought to be utf-8, so I'm not sure why it fails (but I didn't check that). Removing the lines in question makes it work, but as Martin says in the referenced issue I won't be surprised if that breaks working with a file that has a non-utf8 coding cookie. If Martin sees no way to make it work in Python2 it makes it pretty unlikely we'll find a fix :(.
msg187559 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-04-22 12:45
Serhiy has proposed a patch on the older issue, closing this one since it is a duplicate.

History
Date	User	Action	Args
2022-04-11 14:57:42	admin	set	github: 61550
2013-08-20 15:04:05	THRlWiTi	set	nosy: + THRlWiTi
2013-04-22 12:45:51	r.david.murray	set	status: open -> closed messages: + msg187559 stage: resolved
2013-04-22 10:01:27	serhiy.storchaka	set	superseder: 2.7 IDLE console uses incorrect encoding. resolution: duplicate
2013-04-22 05:32:39	r.david.murray	set	resolution: not a bug -> (no value) messages: + msg187549 stage: resolved -> (no value)
2013-04-22 04:50:19	ned.deily	set	messages: - msg187547
2013-04-22 04:46:23	ned.deily	set	messages: + msg187548
2013-04-22 04:41:51	ned.deily	set	nosy: + ned.deily messages: + msg187547
2013-04-22 03:17:42	Tomoki.Imai	set	messages: + msg187546
2013-04-22 01:26:52	terry.reedy	set	resolution: not a bug messages: + msg187542 stage: resolved
2013-04-22 00:43:32	r.david.murray	set	status: closed -> open nosy: + roger.serwy messages: + msg187541 resolution: not a bug -> (no value) stage: resolved -> (no value)
2013-04-21 23:19:42	Tomoki.Imai	set	messages: + msg187537
2013-04-21 20:03:10	r.david.murray	set	title: Unicode - encoding seems to be lost for inputs of unicode chars -> Unicode - encoding seems to be lost for inputs of unicode chars in IDLE
2013-04-21 20:02:01	r.david.murray	set	nosy: + r.david.murray messages: + msg187519
2013-04-21 16:34:56	Tomoki.Imai	set	files: + PyShell.py.20130422.diff type: behavior components: + IDLE, - Unicode keywords: + patch nosy: + Tomoki.Imai messages: + msg187513
2013-03-09 02:25:07	terry.reedy	set	status: open -> closed components: + Unicode, - IDLE nosy: + ezio.melotti, terry.reedy messages: + msg183783 resolution: not a bug stage: resolved
2013-03-04 10:49:37	pradyunsg	create