Issue 23296: ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67485

classification

Title:	‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Clarify error when ‘tokenize.detect_encoding’ receives text View: 23297
Assigned To:		Nosy List:	bignose
Priority:	normal	Keywords:

Created on 2015-01-22 04:40 by bignose, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (1)
msg234470 - (view)	Author: Ben Finney (bignose)	Date: 2015-01-22 04:40
In `tokenize.detect_encoding` is the following code:: first = read_or_stop() if first.startswith(BOM_UTF8): # … The `read_or_stop` function is defined as:: def read_or_stop(): try: return readline() except StopIteration: return b'' So, on catching ``StopIteration``, the return value will be a byte string. The `detect_encoding` code then immediately calls `sartswith`, which fails:: File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding if first.startswith(BOM_UTF8): TypeError: startswith first arg must be str or a tuple of str, not bytes One or both of those locations in the code is wrong. Either `read_or_stop` should never return a byte string; or `detect_encoding` should not assume it can call `startswith` on the result.

msg234470 - (view)

Author: Ben Finney (bignose)

Date: 2015-01-22 04:40

In `tokenize.detect_encoding` is the following code::

    first = read_or_stop()
    if first.startswith(BOM_UTF8):
        # …

The `read_or_stop` function is defined as::

    def read_or_stop():
        try:
            return readline()
        except StopIteration:
            return b''

So, on catching ``StopIteration``, the return value will be a byte string. The `detect_encoding` code then immediately calls `sartswith`, which fails::

    File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding
      if first.startswith(BOM_UTF8):
  TypeError: startswith first arg must be str or a tuple of str, not bytes

One or both of those locations in the code is wrong. Either `read_or_stop` should never return a byte string; or `detect_encoding` should not assume it can call `startswith` on the result.

History
Date	User	Action	Args
2022-04-11 14:58:12	admin	set	github: 67485
2015-01-22 04:58:02	benjamin.peterson	set	status: open -> closed superseder: Clarify error when ‘tokenize.detect_encoding’ receives text resolution: duplicate
2015-01-22 04:40:12	bignose	create