Message 235831 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Pod
Recipients	Pod, bignose, r.david.murray, vstinner
Date	2015-02-12.14:47:47
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1423752467.79.0.0846467890261.issue23297@psf.upfronthosting.co.za>
In-reply-to

Content
Not the OP, but I find this message a bug because it's confusing from the perspective of a user of the tokenize() function. If you give tokenize a readlines() that returns a str, you get this error message that confusingly states that something inside tokenize must be a string and NOT a bytes, even though the user gave readlines a string, not a bytes. It looks like an internal bug. Turns out it's because the contact changed from python2 to 3. Personally, I'd been accidentally reading the python2 page for the tokenize library instead of python3, and had been using tokenize.generate_tokens in my python 3 code which accepts a io.StringIO just fine. When I realising my mistake and switched to the python3 version of the page I noticed generate_tokens is no longer supported, even though the code I had was working, and I noticed that the definition of tokenize had changed to match the old generate_tokens (along with a subtle change in the definition of the acceptable readlines function). So when I switched from tokenize.generate_tokens to tokenize.tokenize to try and use the library as intended, I get the same error as OP. Perhaps OP made a similar mistake? To actually hit the error in question: $ cat -n temp.py 1 import tokenize 2 import io 3 4 5 byte_reader = io.BytesIO(b"test bytes generate_tokens") 6 tokens = tokenize.generate_tokens(byte_reader.readline) 7 8 byte_reader = io.BytesIO(b"test bytes tokenize") 9 tokens = tokenize.tokenize(byte_reader.readline) 10 11 byte_reader = io.StringIO("test string generate") 12 tokens = tokenize.generate_tokens(byte_reader.readline) 13 14 str_reader = io.StringIO("test string tokenize") 15 tokens = tokenize.tokenize(str_reader.readline) 16 17 $ python3 temp.py Traceback (most recent call last): File "temp.py", line 15, in <module> tokens = tokenize.tokenize(str_reader.readline) File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in tokenize encoding, consumed = detect_encoding(readline) File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in detect_encoding if first.startswith(BOM_UTF8): TypeError: startswith first arg must be str or a tuple of str, not bytes

Not the OP, but I find this message a bug because it's confusing from the perspective of a user of the tokenize() function. If you give tokenize a readlines() that returns a str, you get this error message that confusingly states that something inside tokenize must be a string and NOT a bytes, even though the user gave readlines a string, not a bytes. It looks like an internal bug.

Turns out it's because the contact changed from python2 to 3.

Personally, I'd been accidentally reading the python2 page for the tokenize library instead of python3, and had been using tokenize.generate_tokens in my python 3 code which accepts a io.StringIO just fine. When I realising my mistake and switched to the python3 version of the page I noticed generate_tokens is no longer supported, even though the code I had was working, and I noticed that the definition of tokenize had changed to match the old generate_tokens (along with a subtle change in the definition of the acceptable readlines function). 

So when I switched from tokenize.generate_tokens to tokenize.tokenize to try and use the library as intended, I get the same error as OP. Perhaps OP made a similar mistake?



To actually hit the error in question:

        $ cat -n temp.py
             1  import tokenize
             2  import io
             3
             4
             5  byte_reader = io.BytesIO(b"test bytes generate_tokens")
             6  tokens = tokenize.generate_tokens(byte_reader.readline)
             7
             8  byte_reader = io.BytesIO(b"test bytes tokenize")
             9  tokens = tokenize.tokenize(byte_reader.readline)
            10
            11  byte_reader = io.StringIO("test string generate")
            12  tokens = tokenize.generate_tokens(byte_reader.readline)
            13
            14  str_reader = io.StringIO("test string tokenize")
            15  tokens = tokenize.tokenize(str_reader.readline)
            16
            17
        
        $ python3 temp.py
        Traceback (most recent call last):
          File "temp.py", line 15, in <module>
            tokens = tokenize.tokenize(str_reader.readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in tokenize
            encoding, consumed = detect_encoding(readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in detect_encoding
            if first.startswith(BOM_UTF8):
        TypeError: startswith first arg must be str or a tuple of str, not bytes

History
Date	User	Action	Args
2015-02-12 14:47:47	Pod	set	recipients: + Pod, vstinner, r.david.murray, bignose
2015-02-12 14:47:47	Pod	set	messageid: <1423752467.79.0.0846467890261.issue23297@psf.upfronthosting.co.za>
2015-02-12 14:47:47	Pod	link	issue23297 messages
2015-02-12 14:47:47	Pod	create