This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: imp.find_module() ignores -*- coding: Latin-1 -*-
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, brett.cannon, christian.heimes, gvanrossum
Priority: normal Keywords:

Created on 2007-10-15 01:34 by christian.heimes, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg56431 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 01:34
imp.find_module() returns an io.TextIOWrapper instance first value. The
encoding of the TextIOWrapper isn't set from a -*- coding: Latin-1 -*- line.

>>> import imp
>>> imp.find_module("heapq")
(<io.TextIOWrapper object at 0xb7c8f50c>,
'/home/heimes/dev/python/py3k/Lib/heapq.py', ('.py', 'U', 1))
>>> imp.find_module("heapq")[0].read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/heimes/dev/python/py3k/Lib/io.py", line 1224, in read
    res += decoder.decode(self.buffer.read(), True)
  File "/home/heimes/dev/python/py3k/Lib/codecs.py", line 291, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
1428-1430: invalid data
>>> imp.find_module("heapq")[0].encoding
'UTF-8'
>>> imp.find_module("heapq")[0].readline()
'# -*- coding: Latin-1 -*-\n'
msg56451 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 17:29
Can you suggest a patch?

Adding Brett Cannon to the list, possibly his import-in-python would
supersede this?
msg56453 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 17:47
> Can you suggest a patch?
> 
> Adding Brett Cannon to the list, possibly his import-in-python would
> supersede this?

No, I can't suggest a patch. I don't know how we could get the encoding
from the tokenizer or AST.

Brett is obviously the best man to fix the problem. :)

Christian
msg56457 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 18:02
> No, I can't suggest a patch. I don't know how we could get the encoding
> from the tokenizer or AST.

Try harder. :-) Look at the code that accomplishes this feat in the
regular parser...
msg56459 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 18:30
> Try harder. :-) Look at the code that accomplishes this feat in the
> regular parser...

I've already found the methods that find the encoding in
Parser/tokenizer.c: check_coding_spec() and friends.

But it seems like a waste of time to use PyTokenizer_FromFile() just to
find the encoding. *reading* Mmh ... It's not a waste of time if I can
stop the tokenizer. I think it may be possible to use the tokenizer to
get the encoding efficiently. I could read until
tok_state->read_coding_spec or tok_state->indent != 0.

Do you know a better way to stop the tokenizer when the line isn't a
special comment line "# -*-"?

Christian
msg56461 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-10-15 19:30
Call PyTokenizer_Get until the line number is > 2?

On 10/15/07, Christian Heimes <report@bugs.python.org> wrote:
>
> Christian Heimes added the comment:
>
> > Try harder. :-) Look at the code that accomplishes this feat in the
> > regular parser...
>
> I've already found the methods that find the encoding in
> Parser/tokenizer.c: check_coding_spec() and friends.
>
> But it seems like a waste of time to use PyTokenizer_FromFile() just to
> find the encoding. *reading* Mmh ... It's not a waste of time if I can
> stop the tokenizer. I think it may be possible to use the tokenizer to
> get the encoding efficiently. I could read until
> tok_state->read_coding_spec or tok_state->indent != 0.
>
> Do you know a better way to stop the tokenizer when the line isn't a
> special comment line "# -*-"?
>
> Christian
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1278>
> __________________________________
>
msg56462 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-10-15 19:34
No, my work has the exact same problem.  Actually, this bug report has
confirmed for me why heapq could not be imported when I accidentally
forced all open text files to use UTF-8.  I just have not gotten around
to trying to solve this issue yet.  But since importlib just uses open()
directly it has the same problems.

Since it looks like TextIOWrapper does not let one change the encoding
after it has been set, some subclass might need to be written that reads
Looks for the the stanza or else immediately stops and uses the expected
encoding (UTF-8 in the case of Py3K or ASCII for 2.6).  That or expose
some C function that takes a file path or open file that returns a code
object.

But I have bigger fish to fry as my attempt to get around open() being
defined in site.py is actually failing once I clobbered my .pyc files as
codecs requires importing modules, even for ASCII encoding.
msg56463 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-15 19:36
> Call PyTokenizer_Get until the line number is > 2?

That's too easy :]
I'm going to implement the fix tonight.

Christian
msg56575 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-10-19 23:22
The bug was fixed in r58553 together with
http://bugs.python.org/issue1267. Please close this bug.
History
Date User Action Args
2022-04-11 14:56:27adminsetgithub: 45619
2007-10-19 23:36:58gvanrossumsetstatus: open -> closed
resolution: fixed
2007-10-19 23:22:26christian.heimessetmessages: + msg56575
2007-10-16 01:15:09alexandre.vassalottisetnosy: + alexandre.vassalotti
2007-10-15 19:36:01christian.heimessetmessages: + msg56463
2007-10-15 19:34:55brett.cannonsetmessages: + msg56462
2007-10-15 19:30:23gvanrossumsetmessages: + msg56461
2007-10-15 18:30:56christian.heimessetmessages: + msg56459
2007-10-15 18:02:59gvanrossumsetmessages: + msg56457
2007-10-15 17:47:13christian.heimessetmessages: + msg56453
2007-10-15 17:29:19gvanrossumsetnosy: + brett.cannon, gvanrossum
messages: + msg56451
2007-10-15 01:34:28christian.heimescreate