classification
Title: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lukasz.langa, mdartiailh, miss-islington, pablogsal
Priority: normal Keywords: patch

Created on 2021-07-18 13:50 by mdartiailh, last changed 2021-08-02 09:44 by lukasz.langa. This issue is now closed.

Files
File name Uploaded Description Edit
no_newline_at_end_of_file_with_comment.py mdartiailh, 2021-07-18 13:50
Pull Requests
URL Status Linked Edit
PR 27499 merged pablogsal, 2021-07-30 22:28
PR 27500 merged miss-islington, 2021-07-31 01:17
PR 27501 merged miss-islington, 2021-07-31 01:17
Messages (4)
msg397750 - (view) Author: Matthieu Dartiailh (mdartiailh) * Date: 2021-07-18 13:50
Using tokenize.py to tokenize the attached file yields:
0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            OP             ':'
1,5-1,7:            NEWLINE        '\r\n'
2,0-2,4:            INDENT         '    '
2,4-2,5:            NAME           'b'
2,6-2,7:            OP             '='
2,8-2,9:            NUMBER         '1'
2,9-2,11:           NEWLINE        '\r\n'
3,0-3,2:            NL             '\r\n'
4,0-4,6:            COMMENT        '# test'
4,6-4,6:            NL             ''
4,6-4,7:            NEWLINE        ''
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

This output is wrong in that it adds 2 newlines one as a NL which is a correct and one as a NEWLINE which is not since there is no preceding code.

If a new line is added at the end of the file, one gets:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            OP             ':'
1,5-1,7:            NEWLINE        '\r\n'
2,0-2,4:            INDENT         '    '
2,4-2,5:            NAME           'b'
2,6-2,7:            OP             '='
2,8-2,9:            NUMBER         '1'
2,9-2,11:           NEWLINE        '\r\n'
3,0-3,2:            NL             '\r\n'
4,0-4,6:            COMMENT        '# test'
4,6-4,8:            NL             '\r\n'
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

Similarly if code is added before the comment, a single NEWLINE is generated (with no text since it is fake).

The extra NEWLINE found when tokenizing the attached file can cause issue when parsing the file. It was found in https://github.com/we-like-parsers/pegen/pull/11#issuecomment-881926767 where a pure python parser based on pegen is being built. The extra NEWLINE is an issue since the grammar does not accept NEWLINE at the end of a block and cause parsing to fail using the same rules as the python grammar while the cpython parser can handle this file without any issue.

I believe this issue stems from https://github.com/python/cpython/blob/3.9/Lib/tokenize.py#L605 where the check does not account for a last line limited to comments. Adding a check to determine if the line starts with a # should be sufficient to avoid emitting the extra NEWLINE.
msg398615 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-07-31 01:17
New changeset b6bde9fc42aecad5be0457198d17cfe7b481ad79 by Pablo Galindo Salgado in branch 'main':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499)
https://github.com/python/cpython/commit/b6bde9fc42aecad5be0457198d17cfe7b481ad79
msg398738 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-02 09:43
New changeset 33a4010198aad31346e46dafdda17e02f8349017 by Miss Islington (bot) in branch '3.10':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499) (GH-27500)
https://github.com/python/cpython/commit/33a4010198aad31346e46dafdda17e02f8349017
msg398739 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-02 09:44
New changeset 2d11797c81be3ae776e418a5ba507098356d357c by Miss Islington (bot) in branch '3.9':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499) (GH-27501)
https://github.com/python/cpython/commit/2d11797c81be3ae776e418a5ba507098356d357c
History
Date User Action Args
2021-08-02 09:44:05lukasz.langasetmessages: + msg398739
2021-08-02 09:43:49lukasz.langasetnosy: + lukasz.langa
messages: + msg398738
2021-07-31 01:17:51pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-07-31 01:17:35miss-islingtonsetpull_requests: + pull_request26019
2021-07-31 01:17:30miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request26018
2021-07-31 01:17:12pablogsalsetmessages: + msg398615
2021-07-30 22:28:57pablogsalsetkeywords: + patch
nosy: + pablogsal

pull_requests: + pull_request26017
stage: patch review
2021-07-18 13:50:13mdartiailhcreate