Issue 23297: Clarify error when ‘tokenize.detect_encoding’ receives text

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67486

classification

Title:	Clarify error when ‘tokenize.detect_encoding’ receives text
Type:	behavior	Stage:	needs patch
Components:	Documentation, Library (Lib)	Versions:	Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Pod, berker.peksag, bignose, docs@python, r.david.murray, vstinner
Priority:	normal	Keywords:

Created on 2015-01-22 04:40 by bignose, last changed 2022-04-11 14:58 by admin.

Messages (9)
msg234471 - (view)	Author: Ben Finney (bignose)	Date: 2015-01-22 04:40
In `tokenize.detect_encoding` is the following code:: first = read_or_stop() if first.startswith(BOM_UTF8): # … The `read_or_stop` function is defined as:: def read_or_stop(): try: return readline() except StopIteration: return b'' So, on catching ``StopIteration``, the return value will be a byte string. The `detect_encoding` code then immediately calls `sartswith`, which fails:: File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding if first.startswith(BOM_UTF8): TypeError: startswith first arg must be str or a tuple of str, not bytes One or both of those locations in the code is wrong. Either `read_or_stop` should never return a byte string; or `detect_encoding` should not assume it can call `startswith` on the result.
msg234472 - (view)	Author: Ben Finney (bignose)	Date: 2015-01-22 04:41
Possibly related to issue9969.
msg234474 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-01-22 05:29
bytes does support startswith: >>> b'abc'.startswith(b'a') True
msg234481 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-01-22 08:13
I don't understand why do you consider that this issue is a bug. Can you show an example where detect_encoding() raises an exception?
msg235831 - (view)	Author: Pod (Pod)	Date: 2015-02-12 14:47
Not the OP, but I find this message a bug because it's confusing from the perspective of a user of the tokenize() function. If you give tokenize a readlines() that returns a str, you get this error message that confusingly states that something inside tokenize must be a string and NOT a bytes, even though the user gave readlines a string, not a bytes. It looks like an internal bug. Turns out it's because the contact changed from python2 to 3. Personally, I'd been accidentally reading the python2 page for the tokenize library instead of python3, and had been using tokenize.generate_tokens in my python 3 code which accepts a io.StringIO just fine. When I realising my mistake and switched to the python3 version of the page I noticed generate_tokens is no longer supported, even though the code I had was working, and I noticed that the definition of tokenize had changed to match the old generate_tokens (along with a subtle change in the definition of the acceptable readlines function). So when I switched from tokenize.generate_tokens to tokenize.tokenize to try and use the library as intended, I get the same error as OP. Perhaps OP made a similar mistake? To actually hit the error in question: $ cat -n temp.py 1 import tokenize 2 import io 3 4 5 byte_reader = io.BytesIO(b"test bytes generate_tokens") 6 tokens = tokenize.generate_tokens(byte_reader.readline) 7 8 byte_reader = io.BytesIO(b"test bytes tokenize") 9 tokens = tokenize.tokenize(byte_reader.readline) 10 11 byte_reader = io.StringIO("test string generate") 12 tokens = tokenize.generate_tokens(byte_reader.readline) 13 14 str_reader = io.StringIO("test string tokenize") 15 tokens = tokenize.tokenize(str_reader.readline) 16 17 $ python3 temp.py Traceback (most recent call last): File "temp.py", line 15, in <module> tokens = tokenize.tokenize(str_reader.readline) File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in tokenize encoding, consumed = detect_encoding(readline) File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in detect_encoding if first.startswith(BOM_UTF8): TypeError: startswith first arg must be str or a tuple of str, not bytes
msg236338 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-02-20 22:32
The error message could indeed be made clearer by turning it into a message that tokenize itself requires bytes input. Or, more likely, the additional error handling needs to be in detect_encoding.
msg278536 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2016-10-12 18:34
It looks like this can also be fixed by issue 12486.
msg341028 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2019-04-28 13:07
The original problem has already been solved by making tokenize.generate_tokens() public in issue 12486. However, the same exception can be raised when tokenize.open() is used with tokenize.tokenize(), because it returns a text stream: https://github.com/python/cpython/blob/da63b321f63b697f75e7ab2f88f55d907f56c187/Lib/tokenize.py#L396 hello.py -------- def say_hello(): print("Hello, World!") say_hello() text.py ------- import tokenize with tokenize.open('hello.py') as f: token_gen = tokenize.tokenize(f.readline) for token in token_gen: print(token) When we pass f.readline to tokenize.tokenize(), the second call to detect_encoding() fails, because f.readline() returns str. In Lib/test/test_tokenize.py, it seems like tokenize.open() is only tested to open a file. Its output isn't passed to tokenize.tokenize(). Most of the tests either pass the readline() method of open(..., 'rb') or io.BytesIO() to tokenize.tokenize(). I will submit a documentation PR that suggests to use tokenize.generate_tokens() with tokenize.open().
msg341038 - (view)	Author: Ben Finney (bignose)	Date: 2019-04-28 22:48
On 28-Apr-2019, Berker Peksag wrote: > The original problem has already been solved by making > tokenize.generate_tokens() public in issue 12486. I don't understand how that would affect the resolution of this issue. Isn't the correct resolution here going to entail correct implementation in ‘file.readline’?

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67486
2019-04-28 22:48:30	bignose	set	messages: + msg341038
2019-04-28 13:07:56	berker.peksag	set	versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6 nosy: + docs@python messages: + msg341028 assignee: docs@python components: + Documentation
2016-10-12 18:34:18	berker.peksag	set	nosy: + berker.peksag messages: + msg278536
2016-06-21 03:59:33	martin.panter	set	title: ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string -> Clarify error when ‘tokenize.detect_encoding’ receives text stage: needs patch type: crash -> behavior versions: + Python 3.5, Python 3.6, - Python 3.4
2015-02-20 22:32:24	r.david.murray	set	messages: + msg236338
2015-02-12 14:47:47	Pod	set	nosy: + Pod messages: + msg235831
2015-01-22 08:13:01	vstinner	set	nosy: + vstinner messages: + msg234481
2015-01-22 05:29:07	r.david.murray	set	nosy: + r.david.murray messages: + msg234474
2015-01-22 04:58:02	benjamin.peterson	link	issue23296 superseder
2015-01-22 04:41:05	bignose	set	messages: + msg234472
2015-01-22 04:40:26	bignose	create