Issue 17125: tokenizer.tokenize passes a bytes object to str.startswith

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61327

classification

Title:	tokenizer.tokenize passes a bytes object to str.startswith
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.3

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	tokenize module should have a unicode API View: 12486
Assigned To:		Nosy List:	Tyler.Crompton, r.david.murray
Priority:	normal	Keywords:

Created on 2013-02-04 16:51 by Tyler.Crompton, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (2)
msg181349 - (view)	Author: Tyler Crompton (Tyler.Crompton)	Date: 2013-02-04 16:51
Line 402 in lib/python3.3/tokenize.py, contains the following line: if first.startswith(BOM_UTF8): BOM_UTF8 is a bytes object. str.startswith does not accept bytes objects. I was able to use tokenize.tokenize only after making the following changes: Change line 402 to the following: if first.startswith(BOM_UTF8.decode()): Add these two lines at line 374: except AttributeError: line_string = line Change line 485 to the following: try: line = line.decode(encoding) except AttributeError: pass I do not know if these changes are correct as I have not fully tested this module after these changes, but it started working for me. This is the meat of my invokation of tokenize.tokenize: import tokenize with open('example.py') as file: # opening a file encoded as UTF-8 for token in tokenize.tokenize(file.readline): print(token) I am not suggesting that these changes are correct, but I do believe that the current implementation is incorrect. I am also unsure as to what other versions of Python are affected by this.
msg181352 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-02-04 17:06
The docs could certainly be more explicit...currently they state that tokenize is detecting the encoding of the file, which implies but does not make explicit that the input must be binary, not text. The doc problem will get fixed as part of the fix to issue 12486, so I'm closing this as a duplicate. If you want to help out with a patch review and doc patch suggestions on that issue, that would be great.

History
Date	User	Action	Args
2022-04-11 14:57:41	admin	set	github: 61327
2013-02-04 17:06:59	r.david.murray	set	status: open -> closed superseder: tokenize module should have a unicode API nosy: + r.david.murray messages: + msg181352 resolution: duplicate stage: resolved
2013-02-04 16:51:28	Tyler.Crompton	create