untokenize() fails on tokenize output when a newline is missing #79288

gpshead · 2018-10-29T22:26:38Z

BPO	35107
Nosy	@terryjreedy, @gpshead, @taleinat, @meadori, @serhiy-storchaka, @ammaraskar, @pablogsal, @iritkatriel
Superseder	bpo-44667: tokenize.py emits spurious NEWLINE if file ends on a comment without a newline

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2022-01-19.20:50:24.347>
created_at = <Date 2018-10-29.22:26:38.314>
labels = ['3.7', '3.8']
title = 'untokenize() fails on tokenize output when a newline is missing'
updated_at = <Date 2022-01-19.20:52:55.362>
user = 'https://github.com/gpshead'

bugs.python.org fields:

activity = <Date 2022-01-19.20:52:55.362>
actor = 'terry.reedy'
assignee = 'none'
closed = True
closed_date = <Date 2022-01-19.20:50:24.347>
closer = 'terry.reedy'
components = []
creation = <Date 2018-10-29.22:26:38.314>
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 35107
keywords = []
message_count = 9.0
messages = ['328876', '328878', '328879', '328880', '328882', '328884', '328927', '410915', '410982']
nosy_count = 8.0
nosy_names = ['terry.reedy', 'gregory.p.smith', 'taleinat', 'meador.inge', 'serhiy.storchaka', 'ammar2', 'pablogsal', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '44667'
type = None
url = 'https://bugs.python.org/issue35107'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

gpshead · 2018-10-29T22:26:38Z

The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences:

>>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize
    out = ut.untokenize(iterable)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize
    self.add_whitespace(start)
  File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace
    raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,1) precedes previous end (2,0)

The same goes for using the documented tokenize API (generate_tokens is not documented):

tokenize.untokenize(tokenize.tokenize(io.BytesIO(b'#').readline))
...
ValueError: start (1,1) precedes previous end (2,0)

untokenize() is no longer able to work on output of generate_tokens() if the input to generate_tokens() did not end in a newline.

Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns. Very annoying to implement.

ammaraskar · 2018-10-29T22:49:32Z

Looks like this is caused by this line here:

cpython/Lib/tokenize.py

Lines 551 to 558 in b83d917

    
           if line[pos] == '#': 
        
               comment_token = line[pos:].rstrip('\r\n') 
        
               yield TokenInfo(COMMENT, comment_token, 
        
                      (lnum, pos), (lnum, pos + len(comment_token)), line) 
        
               pos += len(comment_token) 
        
           yield TokenInfo(NL, line[pos:], 
        
                      (lnum, pos), (lnum, len(line)), line)

which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in.

ammaraskar · 2018-10-29T23:21:19Z

fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input:

>>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'
>>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Python365\lib\tokenize.py", line 272, in untokenize
    self.add_whitespace(start)
  File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace
    .format(row, col, self.prev_row, self.prev_col))
ValueError: start (1,0) precedes previous end (2,0)

gpshead · 2018-10-30T00:39:33Z

Interesting! I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior.

I don't know what the behavior _should_ be. It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception.

I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip.

ammaraskar · 2018-10-30T00:56:48Z

Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments:

cpython/Lib/tokenize.py

Line 659 in b83d917

if last_line and last_line[-1] not in '\r\n':

And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested.

I'll make a PR this week if no one else gets to it.

serhiy-storchaka · 2018-10-30T06:45:58Z

I am surprised, that removing the newline character adds a token:

>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'),
 TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
 TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
 TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

terryjreedy · 2018-10-30T14:59:51Z

It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them. Moreover, both tuples of the double correction look wrong.

If '\n' is present,
TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n')
looks correct.

If NL represents a real character, the length 0 string='' in the generated
TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
seems wrong. I suspect that the idea was to mis-represent NL to avoid '\n' being added by untokenize. In
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')
string='' is mismatched by length = 2-1 = 1. I am inclined to think that the following would be the correct added token, which should untokenize correctly
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='')

ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there.

iritkatriel · 2022-01-18T23:40:36Z

I am unable to reproduce this on 3.11:

>>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'

terryjreedy · 2022-01-19T20:52:55Z

bpo-44667 tokenize.py emits spurious NEWLINE if file ends on a comment without a newline
Fixed on 3.11, 3.10, 3.9 Aug 2021.

- Fix test break, since `tokenize.tokenize` has buggy behavior that wasn't backported. - See python/cpython#79288 and python/cpython#88833. - Adjust typing using so everything from typing is prepended with `typing.`.

gpshead added 3.7 (EOL) end of life 3.8 only security fixes labels Oct 29, 2018

terryjreedy closed this as completed Jan 19, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

untokenize() fails on tokenize output when a newline is missing #79288

untokenize() fails on tokenize output when a newline is missing #79288

gpshead commented Oct 29, 2018

gpshead commented Oct 29, 2018

ammaraskar commented Oct 29, 2018

ammaraskar commented Oct 29, 2018

gpshead commented Oct 30, 2018

ammaraskar commented Oct 30, 2018

serhiy-storchaka commented Oct 30, 2018

terryjreedy commented Oct 30, 2018

iritkatriel commented Jan 18, 2022

terryjreedy commented Jan 19, 2022

untokenize() fails on tokenize output when a newline is missing #79288

untokenize() fails on tokenize output when a newline is missing #79288

Comments

gpshead commented Oct 29, 2018

gpshead commented Oct 29, 2018

ammaraskar commented Oct 29, 2018

ammaraskar commented Oct 29, 2018

gpshead commented Oct 30, 2018

ammaraskar commented Oct 30, 2018

serhiy-storchaka commented Oct 30, 2018

terryjreedy commented Oct 30, 2018

iritkatriel commented Jan 18, 2022

terryjreedy commented Jan 19, 2022