Issue 34428: tokenize - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78609

classification

Title:	tokenize
Type:	crash	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.8, Python 3.7, Python 3.6, Python 2.7

process

Status:	closed	Resolution:	third party
Dependencies:		Superseder:
Assigned To:		Nosy List:	lkcl, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:

Created on 2018-08-18 10:29 by lkcl, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg323698 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 10:29
https://github.com/hhatto/autopep8/issues/414 the following two lines of code are not parseable by tokenize.py: co = re.compile( "\(") the combination of: * being split on two lines * having a backslash inside quotes * having a bracket inside quotes is an edge-case that _tokenize cannot cope with.
msg323700 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 10:47
these two line also pass (do not throw an exception): co = re.compile( r"\(") the code that fails may be further reduced to the following: ( "\(")
msg323704 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 11:06
python2.7 and 3.5 also has exact same issue.
msg323705 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-08-18 11:21
I can't reproduce. >>> import tokenize >>> list(tokenize.generate_tokens(iter(['(\n', r'"\(")']).__next__)) [TokenInfo(type=53 (OP), string='(', start=(1, 0), end=(1, 1), line='(\n'), TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='(\n'), TokenInfo(type=3 (STRING), string='"\\("', start=(2, 0), end=(2, 4), line='"\\(")'), TokenInfo(type=53 (OP), string=')', start=(2, 4), end=(2, 5), line='"\\(")'), TokenInfo(type=4 (NEWLINE), string='', start=(2, 5), end=(2, 6), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')] Could you please provide a minimal script that reproduces your issue?
msg323707 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 11:37
regular expressions are not something i am familiar or comfortable with (never have been: the patterns are too dense). however REMOVING "Bracket" from the regular expression(s) for PseudoToken "fixes" the problem. some debug print statements dropped in at around line 640 of tokenize.py show that the match on the "working" code with r"\(") as input gives a start/end/spos/epos that is DIFFERENT from when the same code is given just "\(" line 'r"\\(")\n' pos 0 7 r <_sre.SRE_Match object; span=(0, 5), match='r"\\("'> pseudo start/end 0 5 (2, 0) (2, 5) vs line '"\\(")\n' pos 0 6 " <_sre.SRE_Match object; span=(0, 4), match='"\\("'> pseudo start/end 0 4 (5, 0) (5, 4) there may be a way to "fix" this by taking out the pattern matching on Bracket and prioritising everything else. while pos < max: pseudomatch = _compile(PseudoToken).match(line, pos) print ("pos", pos, max, line[pos], pseudomatch) if pseudomatch: # scan for tokens start, end = pseudomatch.span(1) spos, epos, pos = (lnum, start), (lnum, end), end print ("pseudo start/end", start, end, spos, epos) if start == end: continue Bracket = '[][(){}]' Special = group(r'\r?\n', r'\.\.\.', r'[:;.,@]') # REMOVE Bracket Funny = group(Operator, Special) PlainToken = group(Number, Funny, String, Name) Token = Ignore + PlainToken # First (or only) line of ' or " string. ContStr = group(StringPrefix + r"'[^\n'\\](?:\\.[^\n'\\])" + group("'", r'\\\r?\n'), StringPrefix + r'"[^\n"\\](?:\\.[^\n"\\])' + group('"', r'\\\r?\n')) PseudoExtras = group(r'\\\r?\n\|\Z', Comment, Triple) PseudoToken = Whitespace + group(PseudoExtras, Number, Funny, ContStr, Name)
msg323708 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 11:43
wtf??? neither can i!!!! import io import tokenize text = r'''\ ( r"\(") ( "\(") ''' string_io = io.StringIO(text) tokens = list( tokenize.generate_tokens(string_io.readline) ) print (tokens) works perfectly. ok ahhhh i bet you it's something to do with how string_io.readline works, or something to do with the format of the text. give me a sec to triage it.
msg323709 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-18 11:52
ahh darn-it, autopep8 is passing in tokens line-by-line, to be parsed one at a time.... oh and of course it's losing state information that tokenizer critically relies on. i think that's what's going on.... so it's highly unlikely to be a python tokenize bug... can we wait to see what the autopep8 developer says?
msg324029 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-08-25 02:08
Luke: If and when you have legal code that fails when run with purely with CPython, you can re-open or try an new issue.
msg324054 - (view)	Author: Luke Kenneth Casson Leighton (lkcl)	Date: 2018-08-25 08:56
yep good call terry, not getting any response from the autopep8 developer, and i believe it was down to a loop where the text is being thrown line-by-line at tokenize and it was losing critical state information. so... not a bug in tokenize.

History
Date	User	Action	Args
2022-04-11 14:59:04	admin	set	github: 78609
2018-08-25 08:56:59	lkcl	set	messages: + msg324054
2018-08-25 02:08:02	terry.reedy	set	status: open -> closed versions: + Python 3.8, - Python 3.5 nosy: + terry.reedy, serhiy.storchaka messages: + msg324029 resolution: third party stage: resolved
2018-08-18 11:52:18	lkcl	set	messages: + msg323709
2018-08-18 11:43:21	lkcl	set	messages: + msg323708
2018-08-18 11:37:04	lkcl	set	nosy: - serhiy.storchaka messages: + msg323707
2018-08-18 11:21:53	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg323705
2018-08-18 11:06:33	lkcl	set	messages: + msg323704 versions: + Python 2.7, Python 3.5
2018-08-18 10:47:06	lkcl	set	messages: + msg323700
2018-08-18 10:29:27	lkcl	create