This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Clarify error when ‘tokenize.detect_encoding’ receives text
Type: behavior Stage: needs patch
Components: Documentation, Library (Lib) Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Pod, berker.peksag, bignose, docs@python, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2015-01-22 04:40 by bignose, last changed 2022-04-11 14:58 by admin.

Messages (9)
msg234471 - (view) Author: Ben Finney (bignose) Date: 2015-01-22 04:40
In `tokenize.detect_encoding` is the following code::

    first = read_or_stop()
    if first.startswith(BOM_UTF8):
        # …

The `read_or_stop` function is defined as::

    def read_or_stop():
        try:
            return readline()
        except StopIteration:
            return b''

So, on catching ``StopIteration``, the return value will be a byte string. The `detect_encoding` code then immediately calls `sartswith`, which fails::

    File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding
      if first.startswith(BOM_UTF8):
  TypeError: startswith first arg must be str or a tuple of str, not bytes

One or both of those locations in the code is wrong. Either `read_or_stop` should never return a byte string; or `detect_encoding` should not assume it can call `startswith` on the result.
msg234472 - (view) Author: Ben Finney (bignose) Date: 2015-01-22 04:41
Possibly related to issue9969.
msg234474 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-01-22 05:29
bytes does support startswith:

>>> b'abc'.startswith(b'a')
True
msg234481 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-22 08:13
I don't understand why do you consider that this issue is a bug. Can you show an example where detect_encoding() raises an exception?
msg235831 - (view) Author: Pod (Pod) Date: 2015-02-12 14:47
Not the OP, but I find this message a bug because it's confusing from the perspective of a user of the tokenize() function. If you give tokenize a readlines() that returns a str, you get this error message that confusingly states that something inside tokenize must be a string and NOT a bytes, even though the user gave readlines a string, not a bytes. It looks like an internal bug.

Turns out it's because the contact changed from python2 to 3.

Personally, I'd been accidentally reading the python2 page for the tokenize library instead of python3, and had been using tokenize.generate_tokens in my python 3 code which accepts a io.StringIO just fine. When I realising my mistake and switched to the python3 version of the page I noticed generate_tokens is no longer supported, even though the code I had was working, and I noticed that the definition of tokenize had changed to match the old generate_tokens (along with a subtle change in the definition of the acceptable readlines function). 

So when I switched from tokenize.generate_tokens to tokenize.tokenize to try and use the library as intended, I get the same error as OP. Perhaps OP made a similar mistake?



To actually hit the error in question:

        $ cat -n temp.py
             1  import tokenize
             2  import io
             3
             4
             5  byte_reader = io.BytesIO(b"test bytes generate_tokens")
             6  tokens = tokenize.generate_tokens(byte_reader.readline)
             7
             8  byte_reader = io.BytesIO(b"test bytes tokenize")
             9  tokens = tokenize.tokenize(byte_reader.readline)
            10
            11  byte_reader = io.StringIO("test string generate")
            12  tokens = tokenize.generate_tokens(byte_reader.readline)
            13
            14  str_reader = io.StringIO("test string tokenize")
            15  tokens = tokenize.tokenize(str_reader.readline)
            16
            17
        
        $ python3 temp.py
        Traceback (most recent call last):
          File "temp.py", line 15, in <module>
            tokens = tokenize.tokenize(str_reader.readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in tokenize
            encoding, consumed = detect_encoding(readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in detect_encoding
            if first.startswith(BOM_UTF8):
        TypeError: startswith first arg must be str or a tuple of str, not bytes
msg236338 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-02-20 22:32
The error message could indeed be made clearer by turning it into a message that tokenize itself requires bytes input.  Or, more likely, the additional error handling needs to be in detect_encoding.
msg278536 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-10-12 18:34
It looks like this can also be fixed by issue 12486.
msg341028 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-04-28 13:07
The original problem has already been solved by making tokenize.generate_tokens() public in issue 12486.

However, the same exception can be raised when tokenize.open() is used with tokenize.tokenize(), because it returns a text stream:

    https://github.com/python/cpython/blob/da63b321f63b697f75e7ab2f88f55d907f56c187/Lib/tokenize.py#L396

hello.py
--------

def say_hello():
    print("Hello, World!")

say_hello()


text.py
-------

import tokenize

with tokenize.open('hello.py') as f:
    token_gen = tokenize.tokenize(f.readline)
    for token in token_gen:
        print(token)

When we pass f.readline to tokenize.tokenize(), the second call to detect_encoding() fails, because f.readline() returns str.

In Lib/test/test_tokenize.py, it seems like tokenize.open() is only tested to open a file. Its output isn't passed to tokenize.tokenize(). Most of the tests either pass the readline() method of open(..., 'rb') or io.BytesIO() to tokenize.tokenize().

I will submit a documentation PR that suggests to use tokenize.generate_tokens() with tokenize.open().
msg341038 - (view) Author: Ben Finney (bignose) Date: 2019-04-28 22:48
On 28-Apr-2019, Berker Peksag wrote:

> The original problem has already been solved by making
> tokenize.generate_tokens() public in issue 12486.

I don't understand how that would affect the resolution of this issue.

Isn't the correct resolution here going to entail correct
implementation in ‘file.readline’?
History
Date User Action Args
2022-04-11 14:58:12adminsetgithub: 67486
2019-04-28 22:48:30bignosesetmessages: + msg341038
2019-04-28 13:07:56berker.peksagsetversions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
nosy: + docs@python

messages: + msg341028

assignee: docs@python
components: + Documentation
2016-10-12 18:34:18berker.peksagsetnosy: + berker.peksag
messages: + msg278536
2016-06-21 03:59:33martin.pantersettitle: ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string -> Clarify error when ‘tokenize.detect_encoding’ receives text
stage: needs patch
type: crash -> behavior
versions: + Python 3.5, Python 3.6, - Python 3.4
2015-02-20 22:32:24r.david.murraysetmessages: + msg236338
2015-02-12 14:47:47Podsetnosy: + Pod
messages: + msg235831
2015-01-22 08:13:01vstinnersetnosy: + vstinner
messages: + msg234481
2015-01-22 05:29:07r.david.murraysetnosy: + r.david.murray
messages: + msg234474
2015-01-22 04:58:02benjamin.petersonlinkissue23296 superseder
2015-01-22 04:41:05bignosesetmessages: + msg234472
2015-01-22 04:40:26bignosecreate