classification
Title: tokenize: mishandles line joining
Type: behavior Stage: commit review
Components: Extension Modules Versions: Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Anthony Sottile, gregory.p.smith, jaredgrubb, jhylton, meador.inge, miss-islington, rhettinger
Priority: normal Keywords: patch

Created on 2008-02-25 01:55 by jaredgrubb, last changed 2019-05-18 21:02 by gregory.p.smith. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13401 merged Anthony Sottile, 2019-05-18 01:39
Messages (8)
msg62956 - (view) Author: Jared Grubb (jaredgrubb) Date: 2008-02-25 01:59
tokenize does not handle line joining properly, as the following string
fails the CPython tokenizer but passes the tokenize module.

Example 1:
>>> s = "if 1:\n  \\\n  #hey\n  print 1"
>>> exec s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 3
    #hey
       ^
SyntaxError: invalid syntax

>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,2:	NAME	'if'
1,3-1,4:	NUMBER	'1'
1,4-1,5:	OP	':'
1,5-1,6:	NEWLINE	'\n'
2,0-2,2:	INDENT	'  '
3,2-3,6:	COMMENT	'#hey'
3,6-3,7:	NEWLINE	'\n'
4,2-4,7:	NAME	'print'
4,8-4,9:	NUMBER	'1'
5,0-5,0:	DEDENT	''
5,0-5,0:	ENDMARKER	''
msg62960 - (view) Author: Jared Grubb (jaredgrubb) Date: 2008-02-25 02:22
CPython allows \ at EOF, but tokenize does not.

>>> s = 'print 1\\\n'
>>> exec s
1
>>> tokenize.tokenize(StringIO(s).readline)
1,0-1,5:	NAME	'print'
1,6-1,7:	NUMBER	'1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 153, in tokenize
    tokenize_loop(readline, tokeneater)
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 159, in tokenize_loop
    for token_info in generate_tokens(readline):
  File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/tokenize.py",
line 283, in generate_tokens
    raise TokenError, ("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))
msg116977 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-20 21:26
Nobody appears to be interested so I'll close this in a couple of weeks unless someone objects, unless a patch is provided.
msg116985 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-09-20 21:51
Mark, please stop closing these based on age.
The needs to be a determination whether this
is a valid bug.  If so, then a patch is needed.
If not, it can be closed.
msg143716 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011-09-08 01:39
That syntax error is coming from the CPython parser and *not* the tokenizer.  Both CPython and the 'tokenizer' modules produce the same tokenization:

[meadori@motherbrain cpython]$ cat repro.py
if 1:
  \

  pass
[meadori@motherbrain cpython]$ ./python tokenize.py repro.py 
0,0-0,0:        ENCODING        'utf-8'
1,0-1,2:        NAME            'if'
1,3-1,4:        NUMBER          '1'
1,4-1,5:        OP              ':'
1,5-1,6:        NEWLINE         '\n'
2,0-2,2:        INDENT          '  '
3,0-3,1:        NEWLINE         '\n'
4,2-4,6:        NAME            'pass'
4,6-4,7:        NEWLINE         '\n'
5,0-5,0:        DEDENT          ''
5,0-5,0:        ENDMARKER       ''
[44319 refs]
[meadori@motherbrain cpython]$ ./python -d repro.py | grep Token | tail -10
  File "repro.py", line 3
    
    ^
SyntaxError: invalid syntax
[44305 refs]
Token NEWLINE/'' ... It's a token we know
Token DEDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token ENDMARKER/'' ... It's a token we know
Token NAME/'if' ... It's a keyword
Token NUMBER/'1' ... It's a token we know
Token COLON/':' ... It's a token we know
Token NEWLINE/'' ... It's a token we know
Token INDENT/'' ... It's a token we know
Token NEWLINE/'' ... It's a token we know

The NEWLINE INDENT NEWLINE tokenization causes the parser to choke because 'suite' nonterminals:

suite: simple_stmt | NEWLINE INDENT stmt+ DEDENT

are defined as NEWLINE INDENT.

It seems appropriate that the NEWLINE after INDENT should be dropped by both tokenizers.  In other words, I think:
"""
if 1:
  \

  pass
"""

should produce the same tokenization as:

"""
if 1:
  
  pass
"""

This seems consistent with with how explicit line joining is defined [2].


[1] http://hg.python.org/cpython/file/92842e347d98/Grammar/Grammar
[2] http://docs.python.org/reference/lexical_analysis.html#explicit-line-joining
msg339576 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2019-04-07 14:32
Here's an example in the wild which still reproduces with python3.8a3:

https://github.com/SecureAuthCorp/impacket/blob/194b22ed2fc85c4f241375fb7ebe4e0d89626c8c/impacket/examples/remcomsvc.py#L1669

This was reported as a bug on flake8:

https://gitlab.com/pycqa/flake8/issues/532

Here's the reproduction with python3.8:

$ python3.8 --version --version
Python 3.8.0a3 (default, Mar 27 2019, 03:46:44) 
[GCC 7.3.0]
$ python3.8 impacket/examples/remcomsvc.py 
$ python3.8 -mtokenize impacket/examples/remcomsvc.py 
impacket/examples/remcomsvc.py:1670:0: error: EOF in multi-line statement
msg342807 - (view) Author: miss-islington (miss-islington) Date: 2019-05-18 18:27
New changeset abea73bf4a320ff658c9a98fef3d948a142e61a9 by Miss Islington (bot) (Anthony Sottile) in branch 'master':
bpo-2180: Treat line continuation at EOF as a `SyntaxError` (GH-13401)
https://github.com/python/cpython/commit/abea73bf4a320ff658c9a98fef3d948a142e61a9
msg342817 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-05-18 21:02
Thanks for figuring this one out Anthony! :)
History
Date User Action Args
2019-05-18 21:02:56gregory.p.smithsetstatus: open -> closed
resolution: fixed
messages: + msg342817

stage: patch review -> commit review
2019-05-18 18:27:30miss-islingtonsetnosy: + miss-islington
messages: + msg342807
2019-05-18 07:16:47gregory.p.smithsetassignee: gregory.p.smith

nosy: + gregory.p.smith
2019-05-18 01:39:12Anthony Sottilesetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request13312
2019-04-07 14:32:44Anthony Sottilesetnosy: + Anthony Sottile

messages: + msg339576
versions: + Python 3.8, Python 3.9, - Python 3.1, Python 2.7, Python 3.2
2014-02-03 19:15:35BreamoreBoysetnosy: - BreamoreBoy
2011-09-08 01:39:11meador.ingesetmessages: + msg143716
stage: test needed -> needs patch
2010-09-27 03:19:42meador.ingesetnosy: + meador.inge
2010-09-20 21:51:51rhettingersetstatus: pending -> open

nosy: + rhettinger
messages: + msg116985

assignee: jhylton -> (no value)
2010-09-20 21:26:23BreamoreBoysetstatus: open -> pending
nosy: + BreamoreBoy
messages: + msg116977

2010-08-21 17:06:34BreamoreBoyunlinkissue1230484 dependencies
2010-08-21 17:03:39BreamoreBoysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2009-02-16 02:26:11ajaksu2linkissue1230484 dependencies
2009-02-16 02:20:41ajaksu2setstage: test needed
versions: + Python 2.6, - Python 2.5
2008-03-20 03:08:15jafosetassignee: jhylton
nosy: + jhylton
2008-02-25 02:22:29jaredgrubbsetmessages: + msg62960
2008-02-25 01:59:17jaredgrubbsetmessages: + msg62956
2008-02-25 01:55:51jaredgrubbcreate