Issue 8774: tabnanny improperly handles non-ascii source files

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53020

classification

Title:	tabnanny improperly handles non-ascii source files
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, eric.araujo, meatballhat, tim.peters, vstinner, ysj.ray
Priority:	normal	Keywords:	patch

Created on 2010-05-20 03:17 by meatballhat, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
tabnanny_detect_encoding.patch	vstinner, 2010-05-20 22:33

Messages (11)
msg106129 - (view)	Author: Dan Buch (meatballhat)	Date: 2010-05-20 03:17
I noticed while running ``python3 -m tabnanny -v Lib/*.py`` that the process died at heapq.py. The 0x37 char in "François Pinard" (in the ``__about__`` attr) was the culprit. The attached patch replaces it with '\xe7'. Changing the encoding cookie was not necessary to make it work, but seemed like a good idea at the time (I forget if it even matters... haven't worked much in py3k yet.)
msg106134 - (view)	Author: ysj.ray (ysj.ray)	Date: 2010-05-20 09:08
This is the problem with module tabnanny, it always tries to read the py source file as a platform-dependent encoded text module, that is, open the file with builtin function "open()", and with no encoding parameters. It doesn't parse the encoding cookie at the beginning of the fource file! So if a python source file contains some character not encoded in that platform-dependent encoding, the tabnanny module will fail on checking that source file. Not only heapq.py, but also several other stander modules. That platform-dependent encoding is judged as following orders: 1. os.device_encoding(fd) 2. locale.preferredencoding() 3. ascii. I wonder why tabnanny works in this way. Is this the intended behaviour? On my flatform, if I use tabnanny to check a source file which contains some chinese characters and encoded in 'gbk', the UnicodeDecodedError will raise. If this is not the intended behaviour, I guess if we want to fix this problem, we have to change the way tabnanny read the source file. Just like the way python compiler works. First, open the file in "rb" module, then try to detect the encoding use tokenize.detect_encoding() method, then use the dected encoding to open the source file again in text module.
msg106135 - (view)	Author: ysj.ray (ysj.ray)	Date: 2010-05-20 09:16
I add "tim_one" to nosy list since I found this name in Misc/maintainers:tabnanny. Sorry if I did something improper. If this is really a problem, I'm glad to apply a patch for it. Thanks!
msg106137 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-05-20 11:04
PEP 8, section “encodings”, tells that stdlib source code in 3.x should always use ASCII or UTF-8, without encoding magic comment (since UTF-8 is the default now and ASCII is a subset of UTF-8); it explicitly mentions author names in comments or docstrings as the use case for UTF-8 bytes instead of escapes. tl;dr: Don’t mangle people’s names, fix tabnanny.
msg106151 - (view)	Author: Dan Buch (meatballhat)	Date: 2010-05-20 13:38
removed patch because the fix should be made to tabnanny itself
msg106201 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-05-20 22:29
The correct fix is to use tokenize.detect_encoding, if anyone wants to provide a patch.
msg106202 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-20 22:33
> The correct fix is to use tokenize.detect_encoding, > if anyone wants to provide a patch. done :-) Attached patch opens the file in binary mode to call tokenize.detect_encoding() and then use the encoding to open the file a second time (in text (unicode) mode).
msg106203 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-05-20 22:42
You should handle the case of encoding being None.
msg106204 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-20 23:02
> You should handle the case of encoding being None. detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default.
msg106205 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-05-20 23:12
2010/5/20 STINNER Victor <report@bugs.python.org>: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> You should handle the case of encoding being None. > > detect_encoding() never returns None for the encoding. If there is no cookie, utf8 is returned by default. Ah, right. looks ok then
msg106225 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-21 10:53
Commited: r81393 (py3k), r81394 (3.1).

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53020
2010-05-21 10:53:51	vstinner	set	status: open -> closed resolution: fixed messages: + msg106225
2010-05-20 23:12:53	benjamin.peterson	set	messages: + msg106205
2010-05-20 23:02:32	vstinner	set	messages: + msg106204
2010-05-20 22:42:12	benjamin.peterson	set	messages: + msg106203
2010-05-20 22:33:16	vstinner	set	files: + tabnanny_detect_encoding.patch nosy: + vstinner messages: + msg106202
2010-05-20 22:29:06	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg106201
2010-05-20 13:38:33	meatballhat	set	messages: + msg106151
2010-05-20 13:38:21	meatballhat	set	files: - françois-pinard-killed-my-tabnanny.patch
2010-05-20 13:37:48	meatballhat	set	title: 0xe7 in ``heapq.__about__`` causes badness -> tabnanny improperly handles non-ascii source files
2010-05-20 11:04:20	eric.araujo	set	nosy: + eric.araujo messages: + msg106137
2010-05-20 09:16:40	ysj.ray	set	nosy: + tim.peters messages: + msg106135
2010-05-20 09:08:52	ysj.ray	set	nosy: + ysj.ray messages: + msg106134
2010-05-20 03:17:48	meatballhat	set	type: behavior
2010-05-20 03:17:31	meatballhat	create