Title: detect_encoding should fail with SyntaxError on invalid encoding
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.2, Python 3.3
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, flox, haypo, python-dev
Priority: normal Keywords: patch

Created on 2012-06-03 10:29 by flox, last changed 2012-07-07 10:29 by flox. This issue is now closed.

File name Uploaded Description Edit
issue14990_detect_encoding.diff flox, 2012-06-03 10:31 review
Messages (7)
msg162205 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-06-03 10:29
I've hit this issue while playing with tokenize for the module.

The tokenize detect_encoding() should report SyntaxError when the encoding is improperly declared.

However it raises a LookupError in some cases.

$ ./python -m tokenize Lib/test/ 
unexpected error: unknown encoding: utf8-sig
Traceback (most recent call last):
  File "./Lib/", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "./Lib/", line 75, in _run_code
    exec(code, run_globals)
  File "./Lib/", line 686, in <module>
  File "./Lib/", line 656, in main
    tokens = list(tokenize(f.readline))
  File "./Lib/", line 489, in _tokenize
    line = line.decode(encoding)
LookupError: unknown encoding: utf8-sig
msg162206 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-06-03 10:31
This patch seems to fix the issue.
msg162303 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-06-04 23:06
The patch is correct according to the PEP 263:

    If a source file uses both the UTF-8 BOM mark signature and a
    magic encoding comment, the only allowed encoding for the comment
    is 'utf-8'.  Any other encoding will cause an error.

The fix should also be applied to 3.2.

(Note: Python 3.1 doesn't accept bugfixes anymore.)
msg162428 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-06-06 23:11
It should raise a SyntaxError, if coding is 'utf8'.
I don't agree with the last patch proposed.

If the import report a SyntaxError, 'tokenize' should do the same.

$ ./python Lib/test/
  File "Lib/test/", line 1
SyntaxError: encoding problem: utf-8

and it complies strictly with PEP263.
msg162429 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-06-06 23:13
Oops, I didn't want to attach my patch to the issue. Mine is wrong, whereas yours is the right fix :-)
msg164811 - (view) Author: Roundup Robot (python-dev) Date: 2012-07-07 10:27
New changeset 5020afc0b7c9 by Florent Xicluna in branch '3.2':
Issue #14990: tokenize: correctly fail with SyntaxError on invalid encoding declaration.
msg164812 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-07-07 10:29
Thanks. Fixed in trunk too, changeset b4322ad1fec4
Date User Action Args
2012-07-07 10:29:50floxsetstatus: open -> closed
resolution: fixed
messages: + msg164812

stage: patch review -> resolved
2012-07-07 10:27:14python-devsetnosy: + python-dev
messages: + msg164811
2012-06-06 23:13:55hayposetmessages: + msg162429
2012-06-06 23:13:32hayposetfiles: - detect_encoding.patch
2012-06-06 23:11:30floxsetmessages: + msg162428
2012-06-04 23:06:01hayposetfiles: + detect_encoding.patch
versions: - Python 3.1
nosy: + ezio.melotti, haypo

messages: + msg162303

components: + Unicode
2012-06-03 10:31:05floxsetfiles: + issue14990_detect_encoding.diff
keywords: + patch
messages: + msg162206

stage: needs patch -> patch review
2012-06-03 10:29:02floxcreate