classification
Title: IDLE: checksyntax() doesn't support Unicode?
Type: crash Stage:
Components: IDLE Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: geon, haypo, loewis, terry.reedy
Priority: release blocker Keywords: needs review, patch

Created on 2008-10-01 15:37 by haypo, last changed 2009-01-18 20:18 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
idle-3.0rc1-quits-when-run.py haypo, 2008-10-01 15:37
idle_encoding-3.patch haypo, 2008-10-02 21:49 Use tokenize.detect_encoding() to detect Python script encoding
iso.py haypo, 2008-10-03 22:37 Example of non-utf8 file (coding: ISO-8859-1)
idle_encoding_4.patch loewis, 2008-12-29 19:48
Messages (16)
msg74131 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-01 15:37
IDLE checksyntax() function doesn't support Unicode. Example with 
idle-3.0rc1-quits-when-run.py in an ASCII terminal:

$ ./python Tools/scripts/idle
Exception in Tkinter callback
Traceback (most recent call last):
  File "/home/haypo/prog/py3k/Lib/tkinter/__init__.py", line 1405, in 
__call__
    return self.func(*args)
  File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 124, 
in run_module_event
    code = self.checksyntax(filename)
  File "/home/haypo/prog/py3k/Lib/idlelib/ScriptBinding.py", line 86, 
in checksyntax
    source = f.read()
  File "/home/haypo/prog/py3k/Lib/io.py", line 1719, in read
    decoder.decode(self.buffer.read(), final=True))
  File "/home/haypo/prog/py3k/Lib/io.py", line 1294, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/haypo/prog/py3k/Lib/encodings/ascii.py", line 26, in 
decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 
87: ordinal not in range(128)

To open an ASCII terminal on Linux, you can for example use xterm with 
an empty environment (except DISPLAY and HOME variables): "env -i 
DISPLAY=$DISPLAY HOME=$HOME xterm".
msg74134 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-01 16:13
Hum, the problem is that IDLE asks io.open() to detect the charset 
whereas open() doesn't know the #coding: header. So if your locale is 
ASCII, CP1252 or anything different of UTF-8, read the file will 
fails.

I wrote a patch to detect the encoding. Python code (detect_encoding() 
function) is based on PyTokenizer_FindEncoding() and get_coding_spec() 
(from Parser/tokenizer.c). There is no existing Python function to 
detect the encoding of a Python script? (a public function available 
in a Python script)
msg74160 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-02 14:11
Ah! tokenize has already a method detect_encoding(). My new patch uses 
it to avoid code duplication.
msg74161 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-02 14:29
Notice that there is also IOBinding.coding_spec. Not sure whether this
or the one in tokenize is more correct.
msg74197 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-02 21:49
loewis wrote:
> Notice that there is also IOBinding.coding_spec.
> Not sure whether this or the one in tokenize is more correct.

Oh! IOBinding reimplement many features now available in Python like 
universal new line or function to write unicode strings to a file. But 
I don't want to rewrite IDLE, I just want to fix the initial problem: 
IDLE is unable to open a non-ASCII file using "#coding:" header.

So IDLE reimplemented coding detection twice: once in IOBinding and 
once in ScriptBinding. So I wrote a new version of my patch removing 
all the code to reuse tokenize.detect_encoding().

I changed IDLE behaviour: IOBinding._decode() used the locale encoding 
if it's unable to detect the encoding using UTF-8 BOM and/or if the 
#coding: header is missing. Since I also read "Finally, try the 
locale's encoding. This is deprecated", I prefer to remove it. If you 
want to keep the current behaviour, use:
-------------------------
def detect_encoding(filename, default=None):
    with open(filename, 'rb') as f:
        encoding, line = tokenize.detect_encoding(f.readline)
    if (not line) and default:
        return default
    return encoding
...
            encoding = detect_encoding(filename, locale_encoding)
-------------------------

Please review and test my patch (which becomes longer and longer) :-)
msg74202 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-02 22:33
> Oh! IOBinding reimplement many features now available in Python like 
> universal new line or function to write unicode strings to a file.

It did not *re*implement. The implementation in IOBinding predates all
other implementations out there.
msg74207 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-02 23:05
@loewis: Ok, I didn't know. I think that it's better to reuse existing 
code.

I also compared the implementations of encoding detection, and the 
code looks the same in IDLE and tokenize, but I prefer tokenize. 
tokenize.detect_encoding() has longer documentation, return the line 
(decoded as Unicode) matching the encoding cookie, and look to be more 
robust.

I saw an interesting test in IDLE code: it checks the charset. So I 
wrote a patch raising a SyntaxError for tokenize: issue4021.
msg74210 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-02 23:19
I can't reproduce the problem. It works fine for me, displaying the box
drawing character. In case it matters, sys.getpreferredencoding returns
'ANSI_X3.4-1968'; this is on Linux, idle started from an xterm, r66761
msg74280 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-03 22:37
@loewis: I guess that your locale is still UTF-8.

On Linux (Ubuntu Gutsy) using "env -i DISPLAY=$DISPLAY HOME=$HOME 
xterm" to get a new empty environment, I get:

$ locale
LANG=
LC_ALL=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
...
$ python3.0
>>> from idlelib.IOBinding import encoding
>>> encoding 
'ansi_x3.4-1968'
>>> import locale
>>> locale.getdefaultlocale()
(None, None)
>>> locale.nl_langinfo(locale.CODESET)
'ANSI_X3.4-1968'

In this environment, IDLE is unable to detect 
idle-3.0rc1-quits-when-run.py encoding.

IDLE uses open(filename, 'r'): it doesn't specify the charset. In this 
case, TextIOWrapper uses locale.getpreferredencoding() as encoding (or 
ASCII on failure).

To sum IDLE: if your locale is UTF-8, you will be able to open an 
UTF-8 file. So for example, if your locale is UTF-8, you won't be able 
to open an ISO-8859-1 file. Let's try iso.py: IDLE displays the 
error "Failed to decode" and quit whereas I specified the encoding :-/
msg74303 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-04 08:00
> @loewis: I guess that your locale is still UTF-8.

To refute this claim, I reported that locale.getpreferredencoding
reports 'ANSI_X3.4-1968'. I was following your instructions exactly
(on Debian 4.0), and still, it opens successfully (when loaded through
File/Open). Should I do something else with it to trigger the error,
other than opening it?

When opening iso.py, I get a pop window titled "Decoding error",
with a message "Failed to Decode". This seems to be correct also.

So I still can't reproduce the problem.

I don't understand why you say that IDLE uses open(filename, 'r').
In IOBinding.IOBinding.loadfile, I see

            # open the file in binary mode so that we can handle
            # end-of-line convention ourselves.
            f = open(filename,'rb')
msg74312 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-04 11:24
IDLE opens the script many than once. There are two cases:
 (1) first open when IDLE read the file content to display it
 (2) second open on pressing F5 key (Run Module) to check the syntax

(1) uses IOBinding and fails to open ISO-8859-1 file with UTF-8 
locale.

(2) uses ScriptBinding and fails to open UTF-8 file with ASCII locale.

About the initial problem (idle-3.0rc1-quits-when-run.py), yes, I 
forgot to say that you have to run the module, sorry :-/
msg76052 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-19 15:38
This patch has two problems:
1. saving files fails, since there is still a call left to the function
coding_spec, but that function is removed.
2. if saving would work: it doesn't preserve the line endings of the
original file when writing it back. If you open files with DOS line
endings on Unix, upon saving, they should still have DOS line endings.
msg76579 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-11-29 02:15
This is still a problem on my WinXP 3.0rc3 with
# -*- coding: utf-8 -*-
in a file but not with the same pasted directly into the shell Window.
msg78479 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-29 19:48
Here is a new patch that fixes this issue, and the duplicate issues
(#4410, and #4623).

It doesn't try to eliminate code duplication, but fixes coding_spec by
decoding always to Latin-1 first until the coding is known. It fixes
check_syntax by opening the source file in binary. It should have fixed
tabnanny the same way, except that tabnanny cannot properly process byte
tokens.
msg78933 - (view) Author: Pavel Kosina (geon) Date: 2009-01-03 05:00
I vote for fixing this too. This might be simplified/another example of
above mentioned issues:

# -*- coding: utf-8 -*-
print ("ěščřžýáíé")

in IDLE prints this:
>>> 
ěščřžýáíé

When running this script under python command line from another editor,
I get the output readable as expected.
msg80119 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-01-18 20:18
Committed as r68730 and r68731.
History
Date User Action Args
2009-01-18 20:18:13loewissetstatus: open -> closed
resolution: fixed
messages: + msg80119
2009-01-03 05:00:24geonsetnosy: + geon
messages: + msg78933
2008-12-29 19:48:27loewissetpriority: release blocker
keywords: + needs review
messages: + msg78479
files: + idle_encoding_4.patch
2008-12-29 19:42:25loewislinkissue4410 superseder
2008-12-29 19:41:55loewislinkissue4623 superseder
2008-12-04 23:14:22amaury.forgeotdarclinkissue4530 superseder
2008-11-29 02:15:41terry.reedysetnosy: + terry.reedy
type: crash
messages: + msg76579
2008-11-29 01:37:16amaury.forgeotdarclinkissue4454 superseder
2008-11-19 15:38:47loewissetmessages: + msg76052
2008-10-06 21:58:37hayposetfiles: - idle_encoding-2.patch
2008-10-04 11:24:20hayposetmessages: + msg74312
2008-10-04 08:00:53loewissetmessages: + msg74303
2008-10-03 22:37:12hayposetfiles: + iso.py
messages: + msg74280
2008-10-02 23:19:45loewissetmessages: + msg74210
2008-10-02 23:05:39hayposetmessages: + msg74207
2008-10-02 22:33:35loewissetmessages: + msg74202
2008-10-02 21:49:13hayposetfiles: + idle_encoding-3.patch
messages: + msg74197
2008-10-02 14:29:28loewissetnosy: + loewis
messages: + msg74161
2008-10-02 14:11:18hayposetfiles: - idle_encoding.patch
2008-10-02 14:11:13hayposetfiles: + idle_encoding-2.patch
messages: + msg74160
2008-10-01 16:13:59hayposetfiles: + idle_encoding.patch
keywords: + patch
messages: + msg74134
2008-10-01 15:37:54haypocreate