classification
Title: tokenize unconditionally emits NL after comment lines & blank lines
Type: enhancement Stage:
Components: Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: martin.panter, meador.inge, takluyver
Priority: normal Keywords:

Created on 2013-01-28 11:14 by takluyver, last changed 2015-10-05 02:10 by martin.panter.

Messages (4)
msg180846 - (view) Author: Thomas Kluyver (takluyver) * Date: 2013-01-28 11:14
The docs describe the NL token as "Token value used to indicate a non-terminating newline. The NEWLINE token indicates the end of a logical line of Python code; NL tokens are generated when a logical line of code is continued over multiple physical lines."

However, after a comment or a blank line, tokenize emits NL, even when it's not inside a multi-line statement. For example:

In [15]: for tok in tokenize.generate_tokens(StringIO('#comment\n').readline):  print(tok)
TokenInfo(type=54 (COMMENT), string='#comment', start=(1, 0), end=(1, 8), line='#comment\n')
TokenInfo(type=55 (NL), string='\n', start=(1, 8), end=(1, 9), line='#comment\n')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

This makes it difficult to use tokenize to detect multi-line statements, as we want to do in IPython.

In my tests so far, changing two instances of NL to NEWLINE in this block (lines 530 & 533) makes it behave as I expect:
http://hg.python.org/cpython/file/a375c3d88c7e/Lib/tokenize.py#l524
msg181241 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2013-02-03 04:12
The current behavior seems consistent with the lexical definition for
blank lines [1]:

"""
A logical line that contains only spaces, tabs, formfeeds and possibly a
comment, is ignored (i.e., no NEWLINE token is generated).
"""

NL and COMMENT are used for items that the CPython tokenizer
ignores (and are not really tokens).  Also, the test suite explicitly
tests for this case.

Perhaps the tokenize documentation should be updated
to say something like:

"""
NL tokens are generated when a logical line of code is continued over
multiple physical lines and for blank lines.
"""

[1] http://docs.python.org/3.4/reference/lexical_analysis.html#blank-lines
msg182034 - (view) Author: Thomas Kluyver (takluyver) * Date: 2013-02-13 14:11
Hmm, that's interesting.

For our purposes, a blank line or a comment line shouldn't result in a continuation prompt. This is consistent with what the plain Python shell does.

As part of this, we're tokenizing the code, and if the final \n results in a NL token (instead of NEWLINE), we wait to build a 'Python line'. (Likewise if the final \n doesn't appear before EOFError, indicating that a string continues to the next line). Since tokenize doesn't expose parenlev (parentheses level), my modification to tokenize makes this work as we need.

Maybe another way forward would be to make parenlev accessible in some way, so that we can use that rather than using NL == parenlev > 0?
msg252297 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-05 02:10
The plain Python shell does respond to lines with only a comment and/or horizontal space with a continuation prompt. It only treats completely blank lines without any horizontal space specially:

>>>     
... # Indented blank line above; completely blank line below:
... 
>>> 

Meador: The documentation already says what you proposed: “NL tokens are generated when a logical line of code is continued over multiple physical lines” <https://docs.python.org/dev/library/tokenize.html#tokenize.NL>.

Thomas: It sounds like you actually want to differentiate newlines inside bracketed expressions from newlines outside of statements. I think this would require a new feature.

Also, I noticed that an escaped continued newline doesn’t seem to generate any token at all. Not sure if this is a bug or intended, but it does seem inconsistent with the other uses of the NL token.

$ ./python -btWall -m tokenize
1 + \
1,0-1,1:            NUMBER         '1'            
1,2-1,3:            OP             '+'            
1
2,0-2,1:            NUMBER         '1'            
2,1-2,2:            NEWLINE        '\n'           
3,0-3,0:            ENDMARKER      ''
History
Date User Action Args
2015-10-05 02:10:31martin.pantersetversions: + Python 3.6, - Python 2.6, Python 2.7, Python 3.2, Python 3.3
nosy: + martin.panter

messages: + msg252297

type: behavior -> enhancement
2013-02-13 14:11:36takluyversetmessages: + msg182034
2013-02-03 04:12:32meador.ingesettype: behavior
messages: + msg181241
2013-02-02 09:55:36terry.reedysetnosy: + meador.inge
2013-01-28 11:14:28takluyvercreate