Title: PyTokenizer_FindEncoding() never succeeds
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.0
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, brett.cannon, christian.heimes, hyeshik.chang, pitrou
Priority: release blocker Keywords: needs review, patch

Created on 2008-08-19 01:19 by brett.cannon, last changed 2008-09-04 05:04 by brett.cannon. This issue is now closed.

File name Uploaded Description Edit
fix_findencoding.diff brett.cannon, 2008-08-19 05:25
Messages (8)
msg71397 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 01:19
Turns out that PyTokenizer_FindEncoding() never properly succeeds
because the tok_state used by it does not have tok->filename set, which
is an error condition in the tokenizer. This error has been masked by
the one place the function is used, imp.find_module() because a NULL
return is never checked for an error, but instead just assumes the
default source encoding suffices.
msg71398 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 01:20
I have not bothered to check if this exists in 2.6, but I don't see why
it would be any different.
msg71399 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 01:44
Turns out that the NULL return value can signal an error that manifests
itself as SyntaxError("encoding problem: with BOM") thanks to the lack
of tok->filename being set in Parser/tokenizer.c:fp_setreadl() which is
called by check_coding_spec() and assumes that since tok->encoding was
never set (because fp_setreadl() returned an error value) that it had
something to do with the BOM.

The only reason this was found is because my bootstrapping of importlib
into Py3K, at some point, triggers a PyErr_Occurred() which finally
notices the error.
msg71407 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-08-19 05:25
Attached is a patch that fixes where the error occurs. By opening the
file by either file name or file descriptor, the problem goes away. Once
this patch is accepted then PyErr_Occurred() should be added to all uses
of PyTokenizer_FindEncoding().
msg72392 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-03 16:41
I don't understand the whole decoding machinery in the tokenizer, but
the patch looks ok to me. (tested in debug mode under Linux and Windows)
msg72420 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-03 21:26
The patch also looks pretty harmless to me. :)
msg72477 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2008-09-04 03:35
pitrou, that's because Python source code can't be correctly tokenized 
when it's encoded in few odd encodings like iso-2022 or shift-jis which 
utilizes \, (, ) and " as second byte of two-byte character sequence.

For example, '\x81\\' is HORIZONTAL BAR in shift-jis,

exec('print "\x81\\"')

fails. because of " is ignored by second byte of '\x81\\'.
msg72480 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-09-04 05:04
Committed in r66209.
Date User Action Args
2008-09-04 05:04:57brett.cannonsetstatus: open -> closed
resolution: accepted
messages: + msg72480
2008-09-04 03:35:43hyeshik.changsetnosy: + hyeshik.chang
messages: + msg72477
2008-09-03 21:26:19benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg72420
2008-09-03 16:41:40pitrousetnosy: + pitrou
messages: + msg72392
2008-08-21 20:33:50brett.cannonsetkeywords: + needs review
2008-08-21 18:35:14brett.cannonsetpriority: critical -> release blocker
2008-08-19 05:25:15brett.cannonsetfiles: + fix_findencoding.diff
keywords: + patch
messages: + msg71407
2008-08-19 02:37:05brett.cannonlinkissue3574 dependencies
2008-08-19 01:44:30brett.cannonsetmessages: + msg71399
2008-08-19 01:20:05brett.cannonsettype: behavior
messages: + msg71398
2008-08-19 01:19:38brett.cannoncreate