This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lukasz.langa, mdartiailh, miss-islington, pablogsal, terry.reedy
Priority: normal Keywords: patch

Created on 2021-07-18 13:50 by mdartiailh, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
no_newline_at_end_of_file_with_comment.py mdartiailh, 2021-07-18 13:50
Pull Requests
URL Status Linked Edit
PR 27499 merged pablogsal, 2021-07-30 22:28
PR 27500 merged miss-islington, 2021-07-31 01:17
PR 27501 merged miss-islington, 2021-07-31 01:17
Messages (5)
msg397750 - (view) Author: Matthieu Dartiailh (mdartiailh) * Date: 2021-07-18 13:50
Using tokenize.py to tokenize the attached file yields:
0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            OP             ':'
1,5-1,7:            NEWLINE        '\r\n'
2,0-2,4:            INDENT         '    '
2,4-2,5:            NAME           'b'
2,6-2,7:            OP             '='
2,8-2,9:            NUMBER         '1'
2,9-2,11:           NEWLINE        '\r\n'
3,0-3,2:            NL             '\r\n'
4,0-4,6:            COMMENT        '# test'
4,6-4,6:            NL             ''
4,6-4,7:            NEWLINE        ''
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

This output is wrong in that it adds 2 newlines one as a NL which is a correct and one as a NEWLINE which is not since there is no preceding code.

If a new line is added at the end of the file, one gets:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,2:            NAME           'if'
1,3-1,4:            NAME           'a'
1,4-1,5:            OP             ':'
1,5-1,7:            NEWLINE        '\r\n'
2,0-2,4:            INDENT         '    '
2,4-2,5:            NAME           'b'
2,6-2,7:            OP             '='
2,8-2,9:            NUMBER         '1'
2,9-2,11:           NEWLINE        '\r\n'
3,0-3,2:            NL             '\r\n'
4,0-4,6:            COMMENT        '# test'
4,6-4,8:            NL             '\r\n'
5,0-5,0:            DEDENT         ''
5,0-5,0:            ENDMARKER      ''

Similarly if code is added before the comment, a single NEWLINE is generated (with no text since it is fake).

The extra NEWLINE found when tokenizing the attached file can cause issue when parsing the file. It was found in https://github.com/we-like-parsers/pegen/pull/11#issuecomment-881926767 where a pure python parser based on pegen is being built. The extra NEWLINE is an issue since the grammar does not accept NEWLINE at the end of a block and cause parsing to fail using the same rules as the python grammar while the cpython parser can handle this file without any issue.

I believe this issue stems from https://github.com/python/cpython/blob/3.9/Lib/tokenize.py#L605 where the check does not account for a last line limited to comments. Adding a check to determine if the line starts with a # should be sufficient to avoid emitting the extra NEWLINE.
msg398615 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-07-31 01:17
New changeset b6bde9fc42aecad5be0457198d17cfe7b481ad79 by Pablo Galindo Salgado in branch 'main':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499)
https://github.com/python/cpython/commit/b6bde9fc42aecad5be0457198d17cfe7b481ad79
msg398738 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-02 09:43
New changeset 33a4010198aad31346e46dafdda17e02f8349017 by Miss Islington (bot) in branch '3.10':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499) (GH-27500)
https://github.com/python/cpython/commit/33a4010198aad31346e46dafdda17e02f8349017
msg398739 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021-08-02 09:44
New changeset 2d11797c81be3ae776e418a5ba507098356d357c by Miss Islington (bot) in branch '3.9':
bpo-44667: Treat correctly lines ending with comments and no newlines in the Python tokenizer (GH-27499) (GH-27501)
https://github.com/python/cpython/commit/2d11797c81be3ae776e418a5ba507098356d357c
msg410980 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2022-01-19 20:49
This appears to have been a duplicate of #35107, where the failing example was '#' and it was NL, NEWLINE pair was noted.  So this either predates 3.9 or was re-introduced.  In any case, thanks for the fix.
History
Date User Action Args
2022-04-11 14:59:47adminsetgithub: 88833
2022-01-19 20:53:27terry.reedysetmessages: - msg410981
2022-01-19 20:51:45terry.reedysetmessages: + msg410981
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.8
2022-01-19 20:50:24terry.reedylinkissue35107 superseder
2022-01-19 20:49:38terry.reedysetnosy: + terry.reedy
messages: + msg410980
2021-08-02 09:44:05lukasz.langasetmessages: + msg398739
2021-08-02 09:43:49lukasz.langasetnosy: + lukasz.langa
messages: + msg398738
2021-07-31 01:17:51pablogsalsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-07-31 01:17:35miss-islingtonsetpull_requests: + pull_request26019
2021-07-31 01:17:30miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request26018
2021-07-31 01:17:12pablogsalsetmessages: + msg398615
2021-07-30 22:28:57pablogsalsetkeywords: + patch
nosy: + pablogsal

pull_requests: + pull_request26017
stage: patch review
2021-07-18 13:50:13mdartiailhcreate