Title: tokenize: mishandles line joining
Type: behavior Stage: commit review
Components: Extension Modules Versions: Python 3.9, Python 3.8
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Anthony Sottile, gregory.p.smith, jaredgrubb, jhylton, meador.inge, miss-islington, rhettinger
Priority: normal Keywords: patch

Created on 2008-02-25 01:55 by jaredgrubb, last changed 2022-04-11 14:56 by admin. This issue is now closed.

PR 13401 merged Anthony Sottile, 2019-05-18 01:39
Messages (8)
msg62956 - (view) Author: Jared Grubb (jaredgrubb) Date: 2008-02-25 01:59
tokenize does not handle line joining properly, as the following string
fails the CPython tokenizer but passes the tokenize module.

Example 1:
>>> s = "if 1:\n  \\\n  #hey\n  print 1"
>>> exec s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3
SyntaxError: invalid syntax

>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,2:	NAME	'if'
1,3-1,4:	NUMBER	'1'
1,4-1,5:	OP	':'
1,5-1,6:	NEWLINE	'\n'
2,0-2,2:	INDENT	'  '
3,2-3,6:	COMMENT	'#hey'
3,6-3,7:	NEWLINE	'\n'
4,2-4,7:	NAME	'print'
4,8-4,9:	NUMBER	'1'
5,0-5,0:	DEDENT	''
5,0-5,0:	ENDMARKER	''
msg62960 - (view) Author: Jared Grubb (jaredgrubb) Date: 2008-02-25 02:22
CPython allows \ at EOF, but tokenize does not.

>>> s = 'print 1\\\n'
>>> exec s
>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,5:	NAME	'print'
1,6-1,7:	NUMBER	'1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
line 153, in tokenize
    tokenize_loop(readline, tokeneater)
line 159, in tokenize_loop
    for token_info in generate_tokens(readline):
line 283, in generate_tokens
    raise TokenError, ("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))
msg116977 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-20 21:26
Nobody appears to be interested so I'll close this in a couple of weeks unless someone objects, unless a patch is provided.
msg116985 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-09-20 21:51
Mark, please stop closing these based on age.
The needs to be a determination whether this
is a valid bug.  If so, then a patch is needed.
If not, it can be closed.
msg143716 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-09-08 01:39
That syntax error is coming from the CPython parser and *not* the tokenizer.  Both CPython and the 'tokenizer' modules produce the same tokenization:

[meadori@motherbrain cpython]$ cat
if 1:

[meadori@motherbrain cpython]$ ./python 
0,0-0,0:        ENCODING        'utf-8'
1,0-1,2:        NAME            'if'
1,3-1,4:        NUMBER          '1'
1,4-1,5:        OP              ':'
1,5-1,6:        NEWLINE         '\n'
2,0-2,2:        INDENT          '  '
3,0-3,1:        NEWLINE         '\n'
4,2-4,6:        NAME            'pass'
4,6-4,7:        NEWLINE         '\n'
5,0-5,0:        DEDENT          ''
5,0-5,0:        ENDMARKER       ''
[44319 refs]
[meadori@motherbrain cpython]$ ./python -d | grep Token | tail -10
  File "", line 3
SyntaxError: invalid syntax
[44305 refs]
Token NEWLINE/'' ... It's a token we know
Token DEDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token ENDMARKER/'' ... It's a token we know
Token NAME/'if' ... It's a keyword
Token NUMBER/'1' ... It's a token we know
Token COLON/':' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token INDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know

The NEWLINE INDENT NEWLINE tokenization causes the parser to choke because 'suite' nonterminals:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

are defined as NEWLINE INDENT.

It seems appropriate that the NEWLINE after INDENT should be dropped by both tokenizers.  In other words, I think:
if 1:


should produce the same tokenization as:

if 1:

This seems consistent with with how explicit line joining is defined [2].

msg339576 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2019-04-07 14:32
Here's an example in the wild which still reproduces with python3.8a3:

This was reported as a bug on flake8:

Here's the reproduction with python3.8:

$ python3.8 --version --version
Python 3.8.0a3 (default, Mar 27 2019, 03:46:44) 
[GCC 7.3.0]
$ python3.8 impacket/examples/ 
$ python3.8 -mtokenize impacket/examples/ 
impacket/examples/ error: EOF in multi-line statement
msg342807 - (view) Author: miss-islington (miss-islington) Date: 2019-05-18 18:27
New changeset abea73bf4a320ff658c9a98fef3d948a142e61a9 by Miss Islington (bot) (Anthony Sottile) in branch 'master':
bpo-2180: Treat line continuation at EOF as a `SyntaxError` (GH-13401)
msg342817 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-05-18 21:02
Thanks for figuring this one out Anthony! :)
