classification
Title: untokenize() fails on tokenize output when a newline is missing
Type: Stage:
Components: Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ammar2, gregory.p.smith, meador.inge, pablogsal, serhiy.storchaka, taleinat, terry.reedy
Priority: normal Keywords:

Created on 2018-10-29 22:26 by gregory.p.smith, last changed 2018-10-30 14:59 by terry.reedy.

Messages (7)
msg328876 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-10-29 22:26
The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences:

```python
>>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize
    out = ut.untokenize(iterable)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize
    self.add_whitespace(start)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace
    raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,1) precedes previous end (2,0)
```

The same goes for using the documented tokenize API (`generate_tokens` is not documented):

```
tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline))
...
ValueError: start (1,1) precedes previous end (2,0)

```

`untokenize()` is no longer able to work on output of `generate_tokens()` if the input to generate_tokens() did not end in a newline.

Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns.  Very annoying to implement.
msg328878 - (view) Author: Ammar Askar (ammar2) * (Python triager) Date: 2018-10-29 22:49
Looks like this is caused by this line here: 

https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L551-L558

which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.
msg328879 - (view) Author: Ammar Askar (ammar2) * (Python triager) Date: 2018-10-29 23:21
fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input:

>>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'
>>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python365\lib\tokenize.py", line 272, in untokenize
    self.add_whitespace(start)
  File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace
    .format(row, col, self.prev_row, self.prev_col))
ValueError: start (1,0) precedes previous end (2,0)
msg328880 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2018-10-30 00:39
Interesting!  I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior.

I don't know what the behavior _should_ be.  It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception.

I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.
msg328882 - (view) Author: Ammar Askar (ammar2) * (Python triager) Date: 2018-10-30 00:56
Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments: https://github.com/python/cpython/blob/b83d917fafd87e4130f9c7d5209ad2debc7219cd/Lib/tokenize.py#L659

And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested.

I'll make a PR this week if no one else gets to it.
msg328884 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-30 06:45
I am surprised, that removing the newline character adds a token:

>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'),
 TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
 TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
msg328927 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-10-30 14:59
It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them.  Moreover, both tuples of the double correction look wrong.

If '\n' is present,
  TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n')
looks correct.

If NL represents a real character, the length 0 string='' in the generated
  TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
seems wrong.  I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize.  In
  TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')
string='' is mismatched by length = 2-1 = 1.  I am inclined to think that the following would be the correct added token, which should untokenize correctly
  TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='')

ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.
History
Date User Action Args
2018-10-30 14:59:51terry.reedysetmessages: + msg328927
2018-10-30 06:45:57serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg328884
2018-10-30 00:56:47ammar2setmessages: + msg328882
2018-10-30 00:39:32gregory.p.smithsetmessages: + msg328880
2018-10-29 23:21:18ammar2setmessages: + msg328879
2018-10-29 23:16:58pablogsalsetnosy: + pablogsal
2018-10-29 22:49:32ammar2setmessages: + msg328878
2018-10-29 22:26:38gregory.p.smithcreate