Issue 35107: untokenize() fails on tokenize output when a newline is missing

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79288

classification

Title:	untokenize() fails on tokenize output when a newline is missing
Type:		Stage:	resolved
Components:		Versions:	Python 3.8, Python 3.7, Python 3.6

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	tokenize.py emits spurious NEWLINE if file ends on a comment without a newline View: 44667
Assigned To:		Nosy List:	ammar2, gregory.p.smith, iritkatriel, meador.inge, pablogsal, serhiy.storchaka, taleinat, terry.reedy
Priority:	normal	Keywords:

Created on 2018-10-29 22:26 by gregory.p.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg328876 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-10-29 22:26
The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences: ```python >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize out = ut.untokenize(iterable) File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize self.add_whitespace(start) File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace raise ValueError("start ({},{}) precedes previous end ({},{})" ValueError: start (1,1) precedes previous end (2,0) ``` The same goes for using the documented tokenize API (`generate_tokens` is not documented): ``` tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline)) ... ValueError: start (1,1) precedes previous end (2,0) ``` `untokenize()` is no longer able to work on output of `generate_tokens()` if the input to generate_tokens() did not end in a newline. Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns. Very annoying to implement.
msg328878 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-29 22:49
Looks like this is caused by this line here: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L551-L558 which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.
msg328879 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-29 23:21
fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input: >>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline)) '#' >>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "D:\Python365\lib\tokenize.py", line 272, in untokenize self.add_whitespace(start) File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace .format(row, col, self.prev_row, self.prev_col)) ValueError: start (1,0) precedes previous end (2,0)
msg328880 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2018-10-30 00:39
Interesting! I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior. I don't know what the behavior _should_ be. It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception. I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.
msg328882 - (view)	Author: Ammar Askar (ammar2) *	Date: 2018-10-30 00:56
Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L659 And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested. I'll make a PR this week if no one else gets to it.
msg328884 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-10-30 06:45
I am surprised, that removing the newline character adds a token: >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline))) [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'), TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')] >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline))) [TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'), TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'), TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
msg328927 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2018-10-30 14:59
It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them. Moreover, both tuples of the double correction look wrong. If '\n' is present, TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n') looks correct. If NL represents a real character, the length 0 string='' in the generated TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'), seems wrong. I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize. In TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='') string='' is mismatched by length = 2-1 = 1. I am inclined to think that the following would be the correct added token, which should untokenize correctly TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='') ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.
msg410915 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2022-01-18 23:40
I am unable to reproduce this on 3.11: >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline)) '#'
msg410982 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2022-01-19 20:52
#44667 tokenize.py emits spurious NEWLINE if file ends on a comment without a newline Fixed on 3.11, 3.10, 3.9 Aug 2021.

History
Date	User	Action	Args
2022-04-11 14:59:07	admin	set	github: 79288
2022-01-19 20:52:55	terry.reedy	set	messages: + msg410982
2022-01-19 20:50:24	terry.reedy	set	status: pending -> closed superseder: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline resolution: duplicate stage: resolved
2022-01-18 23:40:36	iritkatriel	set	status: open -> pending nosy: + iritkatriel messages: + msg410915
2018-10-30 14:59:51	terry.reedy	set	messages: + msg328927
2018-10-30 06:45:57	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg328884
2018-10-30 00:56:47	ammar2	set	messages: + msg328882
2018-10-30 00:39:32	gregory.p.smith	set	messages: + msg328880
2018-10-29 23:21:18	ammar2	set	messages: + msg328879
2018-10-29 23:16:58	pablogsal	set	nosy: + pablogsal
2018-10-29 22:49:32	ammar2	set	messages: + msg328878
2018-10-29 22:26:38	gregory.p.smith	create