tokenize: mishandles line joining #46433

jaredgrubb · 2008-02-25T01:55:51Z

BPO	2180
Nosy	@rhettinger, @gpshead, @meadori, @asottile, @miss-islington
PRs	bpo-2180: Treat line continuation at EOF as a `SyntaxError` #13401

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/gpshead'
closed_at = <Date 2019-05-18.21:02:56.115>
created_at = <Date 2008-02-25.01:55:51.416>
labels = ['extension-modules', 'type-bug', '3.8', '3.9']
title = 'tokenize: mishandles line joining'
updated_at = <Date 2019-05-18.21:02:56.114>
user = 'https://bugs.python.org/jaredgrubb'

bugs.python.org fields:

activity = <Date 2019-05-18.21:02:56.114>
actor = 'gregory.p.smith'
assignee = 'gregory.p.smith'
closed = True
closed_date = <Date 2019-05-18.21:02:56.115>
closer = 'gregory.p.smith'
components = ['Extension Modules']
creation = <Date 2008-02-25.01:55:51.416>
creator = 'jaredgrubb'
dependencies = []
files = []
hgrepos = []
issue_num = 2180
keywords = ['patch']
message_count = 8.0
messages = ['62956', '62960', '116977', '116985', '143716', '339576', '342807', '342817']
nosy_count = 7.0
nosy_names = ['jhylton', 'rhettinger', 'gregory.p.smith', 'jaredgrubb', 'meador.inge', 'Anthony Sottile', 'miss-islington']
pr_nums = ['13401']
priority = 'normal'
resolution = 'fixed'
stage = 'commit review'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue2180'
versions = ['Python 3.8', 'Python 3.9']

The text was updated successfully, but these errors were encountered:

jaredgrubb · 2008-02-25T01:59:16Z

tokenize does not handle line joining properly, as the following string
fails the CPython tokenizer but passes the tokenize module.

Example 1:
>>> s = "if 1:\n  \\\n  #hey\n  print 1"
>>> exec s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3
    #hey
       ^
SyntaxError: invalid syntax

>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,2:	NAME	'if'
1,3-1,4:	NUMBER	'1'
1,4-1,5:	OP	':'
1,5-1,6:	NEWLINE	'\n'
2,0-2,2:	INDENT	'  '
3,2-3,6:	COMMENT	'#hey'
3,6-3,7:	NEWLINE	'\n'
4,2-4,7:	NAME	'print'
4,8-4,9:	NUMBER	'1'
5,0-5,0:	DEDENT	''
5,0-5,0:	ENDMARKER	''

jaredgrubb · 2008-02-25T02:22:29Z

CPython allows \ at EOF, but tokenize does not.

>>> s = 'print 1\\\n'
>>> exec s
1
>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,5:	NAME	'print'
1,6-1,7:	NUMBER	'1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 153, in tokenize
    tokenize_loop(readline, tokeneater)
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 159, in tokenize_loop
    for token_info in generate_tokens(readline):
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 283, in generate_tokens
    raise TokenError, ("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))

BreamoreBoy · 2010-09-20T21:26:23Z

Nobody appears to be interested so I'll close this in a couple of weeks unless someone objects, unless a patch is provided.

rhettinger · 2010-09-20T21:51:51Z

Mark, please stop closing these based on age.
The needs to be a determination whether this
is a valid bug. If so, then a patch is needed.
If not, it can be closed.

meadori · 2011-09-08T01:39:11Z

That syntax error is coming from the CPython parser and *not* the tokenizer. Both CPython and the 'tokenizer' modules produce the same tokenization:

[meadori@motherbrain cpython]$ cat repro.py
if 1:
\

pass
[meadori@motherbrain cpython]$ ./python tokenize.py repro.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,2: NAME 'if'
1,3-1,4: NUMBER '1'
1,4-1,5: OP ':'
1,5-1,6: NEWLINE '\n'
2,0-2,2: INDENT ' '
3,0-3,1: NEWLINE '\n'
4,2-4,6: NAME 'pass'
4,6-4,7: NEWLINE '\n'
5,0-5,0: DEDENT ''
5,0-5,0: ENDMARKER ''
[44319 refs]
[meadori@motherbrain cpython]$ ./python -d repro.py | grep Token | tail -10
File "repro.py", line 3

SyntaxError: invalid syntax
[44305 refs]
Token NEWLINE/'' ... It's a token we know
Token DEDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token ENDMARKER/'' ... It's a token we know
Token NAME/'if' ... It's a keyword
Token NUMBER/'1' ... It's a token we know
Token COLON/':' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token INDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know

The NEWLINE INDENT NEWLINE tokenization causes the parser to choke because 'suite' nonterminals:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

are defined as NEWLINE INDENT.

It seems appropriate that the NEWLINE after INDENT should be dropped by both tokenizers. In other words, I think:
"""
if 1:
\

pass
"""

should produce the same tokenization as:

"""
if 1:

pass
"""

This seems consistent with with how explicit line joining is defined [2].

[1] http://hg.python.org/cpython/file/92842e347d98/Grammar/Grammar
[2] http://docs.python.org/reference/lexical_analysis.html#explicit-line-joining

asottile · 2019-04-07T14:32:44Z

Here's an example in the wild which still reproduces with python3.8a3:

https://github.com/SecureAuthCorp/impacket/blob/194b22ed2fc85c4f241375fb7ebe4e0d89626c8c/impacket/examples/remcomsvc.py#L1669

This was reported as a bug on flake8:

https://gitlab.com/pycqa/flake8/issues/532

Here's the reproduction with python3.8:

$ python3.8 --version --version
Python 3.8.0a3 (default, Mar 27 2019, 03:46:44) 
[GCC 7.3.0]
$ python3.8 impacket/examples/remcomsvc.py 
$ python3.8 -mtokenize impacket/examples/remcomsvc.py 
impacket/examples/remcomsvc.py:1670:0: error: EOF in multi-line statement

miss-islington · 2019-05-18T18:27:30Z

New changeset abea73b by Miss Islington (bot) (Anthony Sottile) in branch 'master':
bpo-2180: Treat line continuation at EOF as a SyntaxError (GH-13401)
abea73b

gpshead · 2019-05-18T21:02:56Z

Thanks for figuring this one out Anthony! :)

jaredgrubb mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Feb 25, 2008

jafo mannequin assigned jhylton Mar 20, 2008

rhettinger unassigned jhylton Sep 20, 2010

asottile mannequin added 3.8 only security fixes 3.9 only security fixes labels Apr 7, 2019

gpshead self-assigned this May 18, 2019

gpshead closed this as completed May 18, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize: mishandles line joining #46433

tokenize: mishandles line joining #46433

jaredgrubb mannequin commented Feb 25, 2008

jaredgrubb mannequin commented Feb 25, 2008

jaredgrubb mannequin commented Feb 25, 2008

BreamoreBoy mannequin commented Sep 20, 2010

rhettinger commented Sep 20, 2010

meadori commented Sep 8, 2011

asottile mannequin commented Apr 7, 2019

miss-islington commented May 18, 2019

gpshead commented May 18, 2019

tokenize: mishandles line joining #46433

tokenize: mishandles line joining #46433

Comments

jaredgrubb mannequin commented Feb 25, 2008

jaredgrubb mannequin commented Feb 25, 2008

jaredgrubb mannequin commented Feb 25, 2008

BreamoreBoy mannequin commented Sep 20, 2010

rhettinger commented Sep 20, 2010

meadori commented Sep 8, 2011

asottile mannequin commented Apr 7, 2019

miss-islington commented May 18, 2019

gpshead commented May 18, 2019