classification
Title: compile() cannot decode Latin-1 source encodings
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: 2384 3594 Superseder:
Assigned To: brett.cannon Nosy List: barry, benjamin.peterson, brett.cannon, loewis, vstinner
Priority: release blocker Keywords: patch

Created on 2008-08-17 00:39 by brett.cannon, last changed 2008-10-17 03:39 by brett.cannon. This issue is now closed.

Files
File name Uploaded Description Edit
fix_latin.diff brett.cannon, 2008-09-05 20:36
test_pep3120.py brett.cannon, 2008-09-05 20:37
iso.py vstinner, 2008-10-04 00:19 Script encoding in ISO-8859-1 and has the #coding header
tokenizer_iso-8859-1-patch2.patch vstinner, 2008-10-04 01:42
tokenizer_iso-8859-1-patch3.patch vstinner, 2008-10-06 21:49
alt_latin_1.diff brett.cannon, 2008-10-07 00:26 Tweak to Victor's patch
Messages (31)
msg71251 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-17 00:39
The following leads to a SyntaxError in 3.0:

  compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec')

That is not the case in Python 2.6.
msg71252 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-17 00:39
Looks like Parser/tokenizer.c:check_coding_spec() considered Latin-1 a
raw encoding just like UTF-8. Patch is in the works.
msg71257 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-17 01:56
Here is a potential fix. It broke test_imp because it assumed that
Latin-1 source files would be encoded at Latin-1 instead of UTF-8 when
returned by imp.new_module(). Doesn't seem like a critical change as the
file is still properly decoded.
msg71258 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-17 01:57
Attached is a test for test_pep3120 (since that is what most likely
introduced the breakage). It's a separate patch since the source file is
marked as binary and thus can't be diffed by ``svn diff``.
msg71401 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 02:34
Can someone double-check this patch for me? I don't have much experience
with the parser so I want to make sure I am not doing anything wrong.
msg71402 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 02:37
There is a potential dependency on issue3594 as it would change how
imp.find_module() acts and thus make test_imp no longer fail in the way
it has.
msg71403 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-08-19 02:54
That line dates back to the PEP 263 implementation. Martin?
msg71846 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-24 17:57
Since this is marked "release blocker", I'll provide a shallow comment:

I don't think it should be a release blocker. It's a bug in the compile
function, and there are various work-arounds (such as saving the bytes
to a temporary file and executing that one, or decoding the byte string
to a Unicode string, and then compiling the Unicode string). It is
sufficient to fix it in 3.0.1.

I don't think the patch is right: as the test had to be changed, it
means that somewhere, the detection of the encoding declaration now
fails. This is clearly a new bug, but I don't have the time to analyse
the cause further.

In principle, there is nothing wrong with the tokenizer treating latin-1
as "raw" - that only means we don't go through a codec.
msg71850 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-24 18:12
Actually, the tests don't have to change; if issue 3594 gets applied
then that change cascades into this issue and negates the need to change
the tests themselves.

As for treating Latin-1 as a raw encoding, how can that be theoretically
okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?
msg71852 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-24 19:19
> As for treating Latin-1 as a raw encoding, how can that be theoretically
> okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?

The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all
non-ASCII bytes on to the AST, which then needs to deal with them;
it then could (but apparently doesn't) take into account whether the
internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode
for some remains of that.

The other case (besides string literals) where bytes > 127 matter is
tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but
could be easily extended to support Latin-1 as well).

The third case where non-ASCII bytes are allowed is comments; there
they are entirely ignored (i.e. it is not even verified that the
comment is well-formed UTF-8).

Removal of the special case should simplify the code; I would agree
that any speedup gained by not going through a codec is irrelevant.
I'm still puzzled why test_imp if the special case is removed.
msg71855 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-24 19:53
The test_imp stuff has to do with PyTokenizer_FindEncoding().
imp.find_module() only opens the file, passes the file descriptor to
PyTokenizer_FindEncoding() and then returns a file object with the found
encoding.

Problem is that (as issue 3594 points out), PyTokenizer_FindEncoding()
always fails. That means it assumes only the raw encodings are okay.
With Latin-1 being one of them, it returns the file opened as Latin-1 as
is correct. Removing that case here means PyTokenizer_FindEncoding()
fails, and thus assumes only UTF-8 as a legitimate encoding and opens
the files with the UTF-8 encoding. It took a while to find these two
bugs obviously. =)
msg72624 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-09-05 20:36
I have attached a new version of the patch with the changes to test_imp
removed as issue 3594 fixed the need for the change. I have also
directly uploaded test_pep3120.py since it is flagged as binary and thus
cannot be diffed by svn.
msg74287 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-04 00:19
Using py3k trunk + fix_latin.diff:
 - compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec') 
doesn't fail
 - test_pep3120.py is ok
 - but execute a ISO-8859-1 script fails: see attached iso.py

Original Python3:
$ python iso.py
'Bonjour ma ch\xe8re amie'

Patched Python3:
$ python iso.py
'Bonjour ma ch\xc3\xa8re amie'
msg74288 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-04 00:23
See also Lib/test/test_shlex.py: trunk is ok, but with fix_latin.diff 
the test fails.
msg74291 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-04 00:40
It looks like the problem of fix_latin.diff is the decoding_state: 
it's set to STATE_NORMAL whereas current behaviour is to stay in state 
STATE_RAW.

I wrote another patch which is a mix of case 1 (utf-8: just set 
tok->encoding) and case 2 (another charset: set tok->enc, 
tok->encoding and tok>decoding_state): a new case 3 which set enc, 
encoding but stay a the state STATE_RAW. I don't understand my patch, 
so review it (twice or more :-D). Using my patch:
 - compile(...) works
 - test_shlex.py works
 - test_pep3120.py
 - iso.py works
msg74294 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-04 01:24
But why does iso-8859-1 need to be treated as a special case? UTF-8 is
special because it is the default encoding for source. But iso-8859-1
really shouldn't be special, IMO.

Your patch does exactly what happens lower when set_readline() succeeds.
So I want to know why set_readline() is not returning something for
iso-8859-1 (which your patch seems to be suggesting).
msg74295 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-04 01:28
Sorry, I mis-spoke: your patch, Victor, doesn't change the state to
NORMAL. But my worry still stands; why does iso-8859-1 need to be
special-cased? It suggests to me that some more fundamental needs to be
dealt with instead of just patching around iso-8859-1.
msg74296 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-04 01:42
@brett.cannon: I found it: ast.c used a hack for iso-8859-1! Since 
this hack introduces a bug (your compile(...) example), I prefer to 
remove it to simplify to code. The new patch just removes the hack in 
tokenizer.c and ast.c. It does also document encoding and enc 
attributes of tok_state in tokenizer.h.

Using tokenizer_iso-8859-1-patch2.patch, all tests (test_pep3120.py, 
iso.py, test_shlex.py, etc.) are OK.
msg74299 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-04 02:23
Thanks for finding that, Victor! I will do a patch review when I have a
chance (it won't be until after the weekend).
msg74300 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-04 02:36
After reading tokenizer.c 1000 times, I finally used grep:

$ grep -l -i 'iso.8859.1' $(find -name "*.c")
./Python/ast.c <~~~ WTF?
./Objects/unicodeobject.c
./Parser/tokenizer.c
./Modules/cjkcodecs/_codecs_iso2022.c
./Modules/expat/xmltok.c
msg74393 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-06 21:48
My patch version 2 included an "unrelated" fix for the issue2384.
msg74415 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-07 00:26
So I tried the patch (I attached my version with different comments in
the header file and removed some redundant change in whitespacing), and
test_sys consistently fails for me:

 test_current_frames (__main__.SysModuleTest)
 AssertionError: '' != 'g456()'

And this is after a ``make distclean`` and a new compile. This failure
doesn't occur on a clean checkout made after a 'distclean'.
msg74428 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-07 08:21
test_sys failure is fixed by the issue #2384.
msg74620 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-10-10 08:33
Amaury applied my both patches for issues #2384 and #3975. So all 
tests now pass with python trunk + alt_latin_1.diff.
msg74642 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-10 20:24
Yep, it passes for me now.

Martin, have any objection to this patch?
msg74793 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-15 06:05
The patch looks fine to me, please apply.

I notice that the diff file reports changes to test_pep3120.py. No such
changes should be necessary, so please exclude them from committing.
msg74807 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-15 17:15
On Tue, Oct 14, 2008 at 11:05 PM, Martin v. Löwis
<report@bugs.python.org> wrote:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> The patch looks fine to me, please apply.
>

Great!

> I notice that the diff file reports changes to test_pep3120.py. No such
> changes should be necessary, so please exclude them from committing.
>

I added a test for compile() in there, which is why the patch is
claiming that. There is an uploaded version of test_pep3120.py on the
issue.
msg74811 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-15 19:33
> I added a test for compile() in there, which is why the patch is
> claiming that. There is an uploaded version of test_pep3120.py on the
> issue.

Ah, ok. I missed that - that change is also fine.
msg74892 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-10-17 01:55
Brett, please apply and close the issue.
msg74894 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-17 03:34
Sorry about that; been one of those days.

Doing a svn up and making sure it still compiles fine.
msg74895 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-10-17 03:39
Applied in r66951.
History
Date User Action Args
2008-10-17 03:39:11brett.cannonsetstatus: open -> closed
messages: + msg74895
2008-10-17 03:34:07brett.cannonsetmessages: + msg74894
2008-10-17 01:55:11barrysetnosy: + barry
messages: + msg74892
2008-10-15 19:33:06loewissetmessages: + msg74811
2008-10-15 17:15:34brett.cannonsetmessages: + msg74807
2008-10-15 06:05:29loewissetkeywords: - needs review
assignee: loewis -> brett.cannon
resolution: accepted
messages: + msg74793
2008-10-10 20:24:41brett.cannonsetassignee: loewis
messages: + msg74642
2008-10-10 08:33:03vstinnersetmessages: + msg74620
2008-10-07 08:21:16vstinnersetdependencies: + [Py3k] line number is wrong after encoding declaration
messages: + msg74428
2008-10-07 00:26:56brett.cannonsetfiles: + alt_latin_1.diff
2008-10-07 00:26:18brett.cannonsetassignee: brett.cannon -> (no value)
messages: + msg74415
2008-10-06 21:49:17vstinnersetfiles: + tokenizer_iso-8859-1-patch3.patch
2008-10-06 21:49:07vstinnersetfiles: - python3_bytes_filename-3.patch
2008-10-06 21:48:35vstinnersetfiles: + python3_bytes_filename-3.patch
messages: + msg74393
2008-10-06 21:32:38vstinnersetfiles: - tokenizer_iso-8859-1.patch
2008-10-04 02:36:38vstinnersetmessages: + msg74300
2008-10-04 02:23:42brett.cannonsetassignee: brett.cannon
messages: + msg74299
2008-10-04 01:42:22vstinnersetfiles: + tokenizer_iso-8859-1-patch2.patch
messages: + msg74296
2008-10-04 01:28:14brett.cannonsetmessages: + msg74295
2008-10-04 01:24:43brett.cannonsetmessages: + msg74294
2008-10-04 00:40:19vstinnersetfiles: + tokenizer_iso-8859-1.patch
messages: + msg74291
2008-10-04 00:23:23vstinnersetmessages: + msg74288
2008-10-04 00:19:39vstinnersetfiles: + iso.py
nosy: + vstinner
messages: + msg74287
2008-10-02 12:54:23barrysetpriority: deferred blocker -> release blocker
2008-09-26 22:18:34barrysetpriority: release blocker -> deferred blocker
2008-09-18 05:43:20barrysetpriority: deferred blocker -> release blocker
2008-09-09 13:12:01barrysetpriority: release blocker -> deferred blocker
2008-09-05 20:37:13brett.cannonsetfiles: - pep3120_test.diff
2008-09-05 20:37:07brett.cannonsetfiles: + test_pep3120.py
2008-09-05 20:36:41brett.cannonsetfiles: - fix_latin.diff
2008-09-05 20:36:29brett.cannonsetfiles: + fix_latin.diff
messages: + msg72624
2008-08-24 19:53:27brett.cannonsetmessages: + msg71855
2008-08-24 19:19:03loewissetmessages: + msg71852
2008-08-24 18:12:06brett.cannonsetmessages: + msg71850
2008-08-24 17:57:03loewissetmessages: + msg71846
2008-08-21 20:33:57brett.cannonsetkeywords: + needs review
2008-08-21 18:35:23brett.cannonsetpriority: critical -> release blocker
2008-08-19 02:54:03benjamin.petersonsetnosy: + loewis, benjamin.peterson
messages: + msg71403
2008-08-19 02:37:05brett.cannonsetdependencies: + PyTokenizer_FindEncoding() never succeeds
messages: + msg71402
2008-08-19 02:34:23brett.cannonsetmessages: + msg71401
2008-08-17 01:58:08brett.cannonsettype: behavior
2008-08-17 01:57:59brett.cannonsetpriority: critical
files: + pep3120_test.diff
messages: + msg71258
components: + Interpreter Core
versions: + Python 3.0
2008-08-17 01:56:36brett.cannonsetfiles: + fix_latin.diff
keywords: + patch
messages: + msg71257
2008-08-17 00:39:59brett.cannonsetmessages: + msg71252
2008-08-17 00:39:39brett.cannoncreate