This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tabnanny improperly handles non-ascii source files
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, eric.araujo, meatballhat, tim.peters, vstinner, ysj.ray
Priority: normal Keywords: patch

Created on 2010-05-20 03:17 by meatballhat, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tabnanny_detect_encoding.patch vstinner, 2010-05-20 22:33
Messages (11)
msg106129 - (view) Author: Dan Buch (meatballhat) Date: 2010-05-20 03:17
I noticed while running ``python3 -m tabnanny -v Lib/*.py`` that the process died at heapq.py.  The 0x37 char in "François Pinard" (in the ``__about__`` attr) was the culprit.  The attached patch replaces it with '\xe7'.  Changing the encoding cookie was not necessary to make it work, but seemed like a good idea at the time (I forget if it even matters... haven't worked much in py3k yet.)
msg106134 - (view) Author: ysj.ray (ysj.ray) Date: 2010-05-20 09:08
This is the problem with module tabnanny, it always tries to read the py source file as a platform-dependent encoded text module, that is, open the file with builtin function "open()", and with no encoding parameters. It doesn't parse the encoding cookie at the beginning of the fource file! So if a python source file contains some character not encoded in that platform-dependent encoding, the tabnanny module will fail on checking that source file. Not only heapq.py, but also several other stander modules.

That platform-dependent encoding is judged as following orders:
1. os.device_encoding(fd)
2. locale.preferredencoding()
3. ascii.

I wonder why tabnanny works in this way. Is this the intended behaviour?  On my flatform, if I use tabnanny to check a source file which contains some chinese characters and encoded in 'gbk', the UnicodeDecodedError will raise.

If this is not the intended behaviour, I guess if we want to fix this problem, we have to change the way tabnanny read the source file. Just like the way python compiler works. First, open the file in "rb" module, then try to detect the encoding use tokenize.detect_encoding() method, then use the dected encoding to open the source file again in text module.
msg106135 - (view) Author: ysj.ray (ysj.ray) Date: 2010-05-20 09:16
I add "tim_one" to nosy list since I found this name in Misc/maintainers:tabnanny. Sorry if I did something improper.

If this is really a problem, I'm glad to apply a patch for it.

Thanks!
msg106137 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-05-20 11:04
PEP 8, section “encodings”, tells that stdlib source code in 3.x should always use ASCII or UTF-8, without encoding magic comment (since UTF-8 is the default now and ASCII is a subset of UTF-8); it explicitly mentions author names in comments or docstrings as the use case for UTF-8 bytes instead of escapes.

tl;dr: Don’t mangle people’s names, fix tabnanny.
msg106151 - (view) Author: Dan Buch (meatballhat) Date: 2010-05-20 13:38
removed patch because the fix should be made to tabnanny itself
msg106201 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-05-20 22:29
The correct fix is to use tokenize.detect_encoding, if anyone wants to provide a patch.
msg106202 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-20 22:33
> The correct fix is to use tokenize.detect_encoding, 
> if anyone wants to provide a patch.

done :-) Attached patch opens the file in binary mode to call 
tokenize.detect_encoding() and then use the encoding to open the file a second time (in text (unicode) mode).
msg106203 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-05-20 22:42
You should handle the case of encoding being None.
msg106204 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-20 23:02
> You should handle the case of encoding being None.

detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default.
msg106205 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-05-20 23:12
2010/5/20 STINNER Victor <report@bugs.python.org>:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
>> You should handle the case of encoding being None.
>
> detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default.

Ah, right. looks ok then
msg106225 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-21 10:53
Commited: r81393 (py3k), r81394 (3.1).
History
Date User Action Args
2022-04-11 14:57:01adminsetgithub: 53020
2010-05-21 10:53:51vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg106225
2010-05-20 23:12:53benjamin.petersonsetmessages: + msg106205
2010-05-20 23:02:32vstinnersetmessages: + msg106204
2010-05-20 22:42:12benjamin.petersonsetmessages: + msg106203
2010-05-20 22:33:16vstinnersetfiles: + tabnanny_detect_encoding.patch
nosy: + vstinner
messages: + msg106202

2010-05-20 22:29:06benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg106201
2010-05-20 13:38:33meatballhatsetmessages: + msg106151
2010-05-20 13:38:21meatballhatsetfiles: - françois-pinard-killed-my-tabnanny.patch
2010-05-20 13:37:48meatballhatsettitle: 0xe7 in ``heapq.__about__`` causes badness -> tabnanny improperly handles non-ascii source files
2010-05-20 11:04:20eric.araujosetnosy: + eric.araujo
messages: + msg106137
2010-05-20 09:16:40ysj.raysetnosy: + tim.peters
messages: + msg106135
2010-05-20 09:08:52ysj.raysetnosy: + ysj.ray
messages: + msg106134
2010-05-20 03:17:48meatballhatsettype: behavior
2010-05-20 03:17:31meatballhatcreate