This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize
Type: crash Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: lkcl, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2018-08-18 10:29 by lkcl, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg323698 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 10:29
https://github.com/hhatto/autopep8/issues/414

the following two lines of code are not parseable by tokenize.py:

co = re.compile(
            "\(")

the combination of:
* being split on two lines
* having a backslash inside quotes
* having a bracket inside quotes

is an edge-case that _tokenize cannot cope with.
msg323700 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 10:47
these two line also pass (do not throw an exception):

co = re.compile(
            r"\(")

the code that fails may be further reduced to the following:

(
"\(")
msg323704 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 11:06
python2.7 and 3.5 also has exact same issue.
msg323705 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-08-18 11:21
I can't reproduce.

>>> import tokenize
>>> list(tokenize.generate_tokens(iter(['(\n', r'"\(")']).__next__))
[TokenInfo(type=53 (OP), string='(', start=(1, 0), end=(1, 1), line='(\n'), TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='(\n'), TokenInfo(type=3 (STRING), string='"\\("', start=(2, 0), end=(2, 4), line='"\\(")'), TokenInfo(type=53 (OP), string=')', start=(2, 4), end=(2, 5), line='"\\(")'), TokenInfo(type=4 (NEWLINE), string='', start=(2, 5), end=(2, 6), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')]

Could you please provide a minimal script that reproduces your issue?
msg323707 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 11:37
regular expressions are not something i am familiar or comfortable
with (never have been: the patterns are too dense).  however REMOVING
"Bracket" from the regular expression(s) for PseudoToken "fixes"
the problem.

some debug print statements dropped in at around line 640 of
tokenize.py show that the match on the "working" code
with r"\(") as input gives a start/end/spos/epos that is DIFFERENT
from when the same code is given just "\("

line 'r"\\(")\n'
pos 0 7 r <_sre.SRE_Match object; span=(0, 5), match='r"\\("'>
pseudo start/end 0 5 (2, 0) (2, 5)

vs

line '"\\(")\n'
pos 0 6 " <_sre.SRE_Match object; span=(0, 4), match='"\\("'>
pseudo start/end 0 4 (5, 0) (5, 4)

there *may* be a way to "fix" this by taking out the pattern
matching on Bracket and prioritising everything else.


        while pos < max:
            pseudomatch = _compile(PseudoToken).match(line, pos)
            print ("pos", pos, max, line[pos], pseudomatch)
            if pseudomatch:                                # scan for tokens
                start, end = pseudomatch.span(1)
                spos, epos, pos = (lnum, start), (lnum, end), end
                print ("pseudo start/end", start, end, spos, epos)
                if start == end:
                    continue

 
Bracket = '[][(){}]'
Special = group(r'\r?\n', r'\.\.\.', r'[:;.,@]')
# REMOVE Bracket
Funny = group(Operator, Special)

PlainToken = group(Number, Funny, String, Name)
Token = Ignore + PlainToken

# First (or only) line of ' or " string.
ContStr = group(StringPrefix + r"'[^\n'\\]*(?:\\.[^\n'\\]*)*" +
                group("'", r'\\\r?\n'),
                StringPrefix + r'"[^\n"\\]*(?:\\.[^\n"\\]*)*' +
                group('"', r'\\\r?\n'))
PseudoExtras = group(r'\\\r?\n|\Z', Comment, Triple)
PseudoToken = Whitespace + group(PseudoExtras, Number, Funny, ContStr, Name)
msg323708 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 11:43
wtf??? neither can i!!!!

import io
import tokenize

text = r'''\
(
r"\(")

(
"\(")
'''

string_io = io.StringIO(text)
tokens = list(
    tokenize.generate_tokens(string_io.readline)
)

print (tokens)

works perfectly.

ok ahhhh i bet you it's something to do with how
string_io.readline works, or something to do with
the format of the text.  give me a sec to triage it.
msg323709 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-18 11:52
ahh darn-it, autopep8 is passing in tokens line-by-line,
to be parsed one at a time.... oh and of course it's losing
state information that tokenizer critically relies on.

i *think* that's what's going on.... so it's highly unlikely
to be a python tokenize bug... can we wait to see what the
autopep8 developer says?
msg324029 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-08-25 02:08
Luke: If and when you have legal code that fails when run with purely with CPython, you can re-open or try an new issue.
msg324054 - (view) Author: Luke Kenneth Casson Leighton (lkcl) Date: 2018-08-25 08:56
yep good call terry, not getting any response from the
autopep8 developer, and i believe it was down to a loop
where the text is being thrown line-by-line at tokenize
and it was losing critical state information.  so...
not a bug in tokenize.
History
Date User Action Args
2022-04-11 14:59:04adminsetgithub: 78609
2018-08-25 08:56:59lkclsetmessages: + msg324054
2018-08-25 02:08:02terry.reedysetstatus: open -> closed

versions: + Python 3.8, - Python 3.5
nosy: + terry.reedy, serhiy.storchaka

messages: + msg324029
resolution: third party
stage: resolved
2018-08-18 11:52:18lkclsetmessages: + msg323709
2018-08-18 11:43:21lkclsetmessages: + msg323708
2018-08-18 11:37:04lkclsetnosy: - serhiy.storchaka
messages: + msg323707
2018-08-18 11:21:53serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg323705
2018-08-18 11:06:33lkclsetmessages: + msg323704
versions: + Python 2.7, Python 3.5
2018-08-18 10:47:06lkclsetmessages: + msg323700
2018-08-18 10:29:27lkclcreate