This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: untokenize() fails on tokenize output when a newline is missing
Type: Stage: resolved
Components: Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
View: 44667
Assigned To: Nosy List: ammar2, gregory.p.smith, iritkatriel, meador.inge, pablogsal, serhiy.storchaka, taleinat, terry.reedy
Priority: normal Keywords:

Created on 2018-10-29 22:26 by gregory.p.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg328876 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-10-29 22:26
The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences:

```python
>>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize
    out = ut.untokenize(iterable)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize
    self.add_whitespace(start)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace
    raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,1) precedes previous end (2,0)
```

The same goes for using the documented tokenize API (`generate_tokens` is not documented):

```
tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline))
...
ValueError: start (1,1) precedes previous end (2,0)

```

`untokenize()` is no longer able to work on output of `generate_tokens()` if the input to generate_tokens() did not end in a newline.

Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns.  Very annoying to implement.
msg328878 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2018-10-29 22:49
Looks like this is caused by this line here: 

https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L551-L558

which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.
msg328879 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2018-10-29 23:21
fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input:

>>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'
>>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python365\lib\tokenize.py", line 272, in untokenize
    self.add_whitespace(start)
  File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace
    .format(row, col, self.prev_row, self.prev_col))
ValueError: start (1,0) precedes previous end (2,0)
msg328880 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-10-30 00:39
Interesting!  I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior.

I don't know what the behavior _should_ be.  It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception.

I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.
msg328882 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2018-10-30 00:56
Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L659

And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested.

I'll make a PR this week if no one else gets to it.
msg328884 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-30 06:45
I am surprised, that removing the newline character adds a token:

>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'),
 TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
 TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
msg328927 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-10-30 14:59
It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them.  Moreover, both tuples of the double correction look wrong.

If '\n' is present,
  TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n')
looks correct.

If NL represents a real character, the length 0 string='' in the generated
  TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
seems wrong.  I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize.  In
  TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')
string='' is mismatched by length = 2-1 = 1.  I am inclined to think that the following would be the correct added token, which should untokenize correctly
  TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='')

ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.
msg410915 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2022-01-18 23:40
I am unable to reproduce this on 3.11:


>>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'
msg410982 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2022-01-19 20:52
#44667 tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
Fixed on 3.11, 3.10, 3.9 Aug 2021.
History
Date User Action Args
2022-04-11 14:59:07adminsetgithub: 79288
2022-01-19 20:52:55terry.reedysetmessages: + msg410982
2022-01-19 20:50:24terry.reedysetstatus: pending -> closed
superseder: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
resolution: duplicate
stage: resolved
2022-01-18 23:40:36iritkatrielsetstatus: open -> pending
nosy: + iritkatriel
messages: + msg410915

2018-10-30 14:59:51terry.reedysetmessages: + msg328927
2018-10-30 06:45:57serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg328884
2018-10-30 00:56:47ammar2setmessages: + msg328882
2018-10-30 00:39:32gregory.p.smithsetmessages: + msg328880
2018-10-29 23:21:18ammar2setmessages: + msg328879
2018-10-29 23:16:58pablogsalsetnosy: + pablogsal
2018-10-29 22:49:32ammar2setmessages: + msg328878
2018-10-29 22:26:38gregory.p.smithcreate