Issue 3574: compile() cannot decode Latin-1 source encodings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47824

classification

Title:	compile() cannot decode Latin-1 source encodings
Type:	behavior	Stage:
Components:	Interpreter Core	Versions:	Python 3.0

process

Status:	closed	Resolution:	accepted
Dependencies:	2384 3594	Superseder:
Assigned To:	brett.cannon	Nosy List:	barry, benjamin.peterson, brett.cannon, loewis, vstinner
Priority:	release blocker	Keywords:	patch

Created on 2008-08-17 00:39 by brett.cannon, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix_latin.diff	brett.cannon, 2008-09-05 20:36
test_pep3120.py	brett.cannon, 2008-09-05 20:37
iso.py	vstinner, 2008-10-04 00:19	Script encoding in ISO-8859-1 and has the #coding header
tokenizer_iso-8859-1-patch2.patch	vstinner, 2008-10-04 01:42
tokenizer_iso-8859-1-patch3.patch	vstinner, 2008-10-06 21:49
alt_latin_1.diff	brett.cannon, 2008-10-07 00:26	Tweak to Victor's patch

Messages (31)
msg71251 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-17 00:39
The following leads to a SyntaxError in 3.0: compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec') That is not the case in Python 2.6.
msg71252 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-17 00:39
Looks like Parser/tokenizer.c:check_coding_spec() considered Latin-1 a raw encoding just like UTF-8. Patch is in the works.
msg71257 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-17 01:56
Here is a potential fix. It broke test_imp because it assumed that Latin-1 source files would be encoded at Latin-1 instead of UTF-8 when returned by imp.new_module(). Doesn't seem like a critical change as the file is still properly decoded.
msg71258 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-17 01:57
Attached is a test for test_pep3120 (since that is what most likely introduced the breakage). It's a separate patch since the source file is marked as binary and thus can't be diffed by ``svn diff``.
msg71401 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 02:34
Can someone double-check this patch for me? I don't have much experience with the parser so I want to make sure I am not doing anything wrong.
msg71402 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-19 02:37
There is a potential dependency on issue3594 as it would change how imp.find_module() acts and thus make test_imp no longer fail in the way it has.
msg71403 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-08-19 02:54
That line dates back to the PEP 263 implementation. Martin?
msg71846 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-08-24 17:57
Since this is marked "release blocker", I'll provide a shallow comment: I don't think it should be a release blocker. It's a bug in the compile function, and there are various work-arounds (such as saving the bytes to a temporary file and executing that one, or decoding the byte string to a Unicode string, and then compiling the Unicode string). It is sufficient to fix it in 3.0.1. I don't think the patch is right: as the test had to be changed, it means that somewhere, the detection of the encoding declaration now fails. This is clearly a new bug, but I don't have the time to analyse the cause further. In principle, there is nothing wrong with the tokenizer treating latin-1 as "raw" - that only means we don't go through a codec.
msg71850 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-24 18:12
Actually, the tests don't have to change; if issue 3594 gets applied then that change cascades into this issue and negates the need to change the tests themselves. As for treating Latin-1 as a raw encoding, how can that be theoretically okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?
msg71852 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-08-24 19:19
> As for treating Latin-1 as a raw encoding, how can that be theoretically > okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1? The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all non-ASCII bytes on to the AST, which then needs to deal with them; it then could (but apparently doesn't) take into account whether the internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode for some remains of that. The other case (besides string literals) where bytes > 127 matter is tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but could be easily extended to support Latin-1 as well). The third case where non-ASCII bytes are allowed is comments; there they are entirely ignored (i.e. it is not even verified that the comment is well-formed UTF-8). Removal of the special case should simplify the code; I would agree that any speedup gained by not going through a codec is irrelevant. I'm still puzzled why test_imp if the special case is removed.
msg71855 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-08-24 19:53
The test_imp stuff has to do with PyTokenizer_FindEncoding(). imp.find_module() only opens the file, passes the file descriptor to PyTokenizer_FindEncoding() and then returns a file object with the found encoding. Problem is that (as issue 3594 points out), PyTokenizer_FindEncoding() always fails. That means it assumes only the raw encodings are okay. With Latin-1 being one of them, it returns the file opened as Latin-1 as is correct. Removing that case here means PyTokenizer_FindEncoding() fails, and thus assumes only UTF-8 as a legitimate encoding and opens the files with the UTF-8 encoding. It took a while to find these two bugs obviously. =)
msg72624 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-09-05 20:36
I have attached a new version of the patch with the changes to test_imp removed as issue 3594 fixed the need for the change. I have also directly uploaded test_pep3120.py since it is flagged as binary and thus cannot be diffed by svn.
msg74287 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-04 00:19
Using py3k trunk + fix_latin.diff: - compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec') doesn't fail - test_pep3120.py is ok - but execute a ISO-8859-1 script fails: see attached iso.py Original Python3: $ python iso.py 'Bonjour ma ch\xe8re amie' Patched Python3: $ python iso.py 'Bonjour ma ch\xc3\xa8re amie'
msg74288 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-04 00:23
See also Lib/test/test_shlex.py: trunk is ok, but with fix_latin.diff the test fails.
msg74291 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-04 00:40
It looks like the problem of fix_latin.diff is the decoding_state: it's set to STATE_NORMAL whereas current behaviour is to stay in state STATE_RAW. I wrote another patch which is a mix of case 1 (utf-8: just set tok->encoding) and case 2 (another charset: set tok->enc, tok->encoding and tok>decoding_state): a new case 3 which set enc, encoding but stay a the state STATE_RAW. I don't understand my patch, so review it (twice or more :-D). Using my patch: - compile(...) works - test_shlex.py works - test_pep3120.py - iso.py works
msg74294 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-04 01:24
But why does iso-8859-1 need to be treated as a special case? UTF-8 is special because it is the default encoding for source. But iso-8859-1 really shouldn't be special, IMO. Your patch does exactly what happens lower when set_readline() succeeds. So I want to know why set_readline() is not returning something for iso-8859-1 (which your patch seems to be suggesting).
msg74295 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-04 01:28
Sorry, I mis-spoke: your patch, Victor, doesn't change the state to NORMAL. But my worry still stands; why does iso-8859-1 need to be special-cased? It suggests to me that some more fundamental needs to be dealt with instead of just patching around iso-8859-1.
msg74296 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-04 01:42
@brett.cannon: I found it: ast.c used a hack for iso-8859-1! Since this hack introduces a bug (your compile(...) example), I prefer to remove it to simplify to code. The new patch just removes the hack in tokenizer.c and ast.c. It does also document encoding and enc attributes of tok_state in tokenizer.h. Using tokenizer_iso-8859-1-patch2.patch, all tests (test_pep3120.py, iso.py, test_shlex.py, etc.) are OK.
msg74299 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-04 02:23
Thanks for finding that, Victor! I will do a patch review when I have a chance (it won't be until after the weekend).
msg74300 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-04 02:36
After reading tokenizer.c 1000 times, I finally used grep: $ grep -l -i 'iso.8859.1' $(find -name "*.c") ./Python/ast.c <~~~ WTF? ./Objects/unicodeobject.c ./Parser/tokenizer.c ./Modules/cjkcodecs/_codecs_iso2022.c ./Modules/expat/xmltok.c
msg74393 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-06 21:48
My patch version 2 included an "unrelated" fix for the issue2384.
msg74415 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-07 00:26
So I tried the patch (I attached my version with different comments in the header file and removed some redundant change in whitespacing), and test_sys consistently fails for me: test_current_frames (__main__.SysModuleTest) AssertionError: '' != 'g456()' And this is after a ``make distclean`` and a new compile. This failure doesn't occur on a clean checkout made after a 'distclean'.
msg74428 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-07 08:21
test_sys failure is fixed by the issue #2384.
msg74620 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-10 08:33
Amaury applied my both patches for issues #2384 and #3975. So all tests now pass with python trunk + alt_latin_1.diff.
msg74642 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-10 20:24
Yep, it passes for me now. Martin, have any objection to this patch?
msg74793 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-10-15 06:05
The patch looks fine to me, please apply. I notice that the diff file reports changes to test_pep3120.py. No such changes should be necessary, so please exclude them from committing.
msg74807 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-15 17:15
On Tue, Oct 14, 2008 at 11:05 PM, Martin v. Löwis <report@bugs.python.org> wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > The patch looks fine to me, please apply. > Great! > I notice that the diff file reports changes to test_pep3120.py. No such > changes should be necessary, so please exclude them from committing. > I added a test for compile() in there, which is why the patch is claiming that. There is an uploaded version of test_pep3120.py on the issue.
msg74811 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-10-15 19:33
> I added a test for compile() in there, which is why the patch is > claiming that. There is an uploaded version of test_pep3120.py on the > issue. Ah, ok. I missed that - that change is also fine.
msg74892 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2008-10-17 01:55
Brett, please apply and close the issue.
msg74894 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-17 03:34
Sorry about that; been one of those days. Doing a svn up and making sure it still compiles fine.
msg74895 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2008-10-17 03:39
Applied in r66951.

History
Date	User	Action	Args
2022-04-11 14:56:37	admin	set	github: 47824
2008-10-17 03:39:11	brett.cannon	set	status: open -> closed messages: + msg74895
2008-10-17 03:34:07	brett.cannon	set	messages: + msg74894
2008-10-17 01:55:11	barry	set	nosy: + barry messages: + msg74892
2008-10-15 19:33:06	loewis	set	messages: + msg74811
2008-10-15 17:15:34	brett.cannon	set	messages: + msg74807
2008-10-15 06:05:29	loewis	set	keywords: - needs review assignee: loewis -> brett.cannon resolution: accepted messages: + msg74793
2008-10-10 20:24:41	brett.cannon	set	assignee: loewis messages: + msg74642
2008-10-10 08:33:03	vstinner	set	messages: + msg74620
2008-10-07 08:21:16	vstinner	set	dependencies: + [Py3k] line number is wrong after encoding declaration messages: + msg74428
2008-10-07 00:26:56	brett.cannon	set	files: + alt_latin_1.diff
2008-10-07 00:26:18	brett.cannon	set	assignee: brett.cannon -> (no value) messages: + msg74415
2008-10-06 21:49:17	vstinner	set	files: + tokenizer_iso-8859-1-patch3.patch
2008-10-06 21:49:07	vstinner	set	files: - python3_bytes_filename-3.patch
2008-10-06 21:48:35	vstinner	set	files: + python3_bytes_filename-3.patch messages: + msg74393
2008-10-06 21:32:38	vstinner	set	files: - tokenizer_iso-8859-1.patch
2008-10-04 02:36:38	vstinner	set	messages: + msg74300
2008-10-04 02:23:42	brett.cannon	set	assignee: brett.cannon messages: + msg74299
2008-10-04 01:42:22	vstinner	set	files: + tokenizer_iso-8859-1-patch2.patch messages: + msg74296
2008-10-04 01:28:14	brett.cannon	set	messages: + msg74295
2008-10-04 01:24:43	brett.cannon	set	messages: + msg74294
2008-10-04 00:40:19	vstinner	set	files: + tokenizer_iso-8859-1.patch messages: + msg74291
2008-10-04 00:23:23	vstinner	set	messages: + msg74288
2008-10-04 00:19:39	vstinner	set	files: + iso.py nosy: + vstinner messages: + msg74287
2008-10-02 12:54:23	barry	set	priority: deferred blocker -> release blocker
2008-09-26 22:18:34	barry	set	priority: release blocker -> deferred blocker
2008-09-18 05:43:20	barry	set	priority: deferred blocker -> release blocker
2008-09-09 13:12:01	barry	set	priority: release blocker -> deferred blocker
2008-09-05 20:37:13	brett.cannon	set	files: - pep3120_test.diff
2008-09-05 20:37:07	brett.cannon	set	files: + test_pep3120.py
2008-09-05 20:36:41	brett.cannon	set	files: - fix_latin.diff
2008-09-05 20:36:29	brett.cannon	set	files: + fix_latin.diff messages: + msg72624
2008-08-24 19:53:27	brett.cannon	set	messages: + msg71855
2008-08-24 19:19:03	loewis	set	messages: + msg71852
2008-08-24 18:12:06	brett.cannon	set	messages: + msg71850
2008-08-24 17:57:03	loewis	set	messages: + msg71846
2008-08-21 20:33:57	brett.cannon	set	keywords: + needs review
2008-08-21 18:35:23	brett.cannon	set	priority: critical -> release blocker
2008-08-19 02:54:03	benjamin.peterson	set	nosy: + loewis, benjamin.peterson messages: + msg71403
2008-08-19 02:37:05	brett.cannon	set	dependencies: + PyTokenizer_FindEncoding() never succeeds messages: + msg71402
2008-08-19 02:34:23	brett.cannon	set	messages: + msg71401
2008-08-17 01:58:08	brett.cannon	set	type: behavior
2008-08-17 01:57:59	brett.cannon	set	priority: critical files: + pep3120_test.diff messages: + msg71258 components: + Interpreter Core versions: + Python 3.0
2008-08-17 01:56:36	brett.cannon	set	files: + fix_latin.diff keywords: + patch messages: + msg71257
2008-08-17 00:39:59	brett.cannon	set	messages: + msg71252
2008-08-17 00:39:39	brett.cannon	create