New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
untokenize() fails on tokenize output when a newline is missing #79288
Comments
The behavior change introduced in 3.6.7 and 3.7.1 via https://bugs.python.org/issue33899 has further consequences: >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../cpython/cpython-upstream/Lib/tokenize.py", line 332, in untokenize
out = ut.untokenize(iterable)
File ".../cpython/cpython-upstream/Lib/tokenize.py", line 266, in untokenize
self.add_whitespace(start)
File ".../cpython/cpython-upstream/Lib/tokenize.py", line 227, in add_whitespace
raise ValueError("start ({},{}) precedes previous end ({},{})"
ValueError: start (1,1) precedes previous end (2,0) The same goes for using the documented tokenize API (
Today's workaround: Always append a newline if one is missing to the line that the readline callable passed to tokenize or generate_tokens returns. Very annoying to implement. |
Looks like this is caused by this line here: Lines 551 to 558 in b83d917
which adds a newline token implicitly after comments. Since the input didn't terminate with a '\n', the code to add a newline at the end of input kicks in. |
fwiw I think there's more at play here than the newline change. This is the behavior I get on 3.6.5 (before the newline change is applied). # works as expected but check out this input: >>> t.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#'
>>> t.untokenize(tokenize.generate_tokens(io.StringIO('x=1').readline))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python365\lib\tokenize.py", line 272, in untokenize
self.add_whitespace(start)
File "D:\Python365\lib\tokenize.py", line 234, in add_whitespace
.format(row, col, self.prev_row, self.prev_col))
ValueError: start (1,0) precedes previous end (2,0) |
Interesting! I have a 3.6.2 sitting around and cannot reproduce that "x=1" behavior. I don't know what the behavior _should_ be. It just feels natural that untokenize should be able to round trip anything tokenize or generate_tokens emits without raising an exception. I'm filing this as the "#" case came up within some existing code we had that happened to effectively test that particular round trip. |
Actually nevermind, disregard that, I was just testing it wrong. I think the simplest fix here is to add '#' to the list of characters here so we don't double insert newlines for comments: Line 659 in b83d917
And a test for round tripping a file ending with a comment but no newline will allow that particular branch to be tested. I'll make a PR this week if no one else gets to it. |
I am surprised, that removing the newline character adds a token: >>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#\n').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#\n'),
TokenInfo(type=56 (NL), string='\n', start=(1, 1), end=(1, 2), line='#\n'),
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
>>> pprint.pprint(list(tokenize.generate_tokens(io.StringIO('#').readline)))
[TokenInfo(type=55 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
TokenInfo(type=56 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')] |
It seems to me a bug that if '\n' is not present, tokenize adds both NL and NEWLINE tokens, instead of just one of them. Moreover, both tuples of the double correction look wrong. If '\n' is present, If NL represents a real character, the length 0 string='' in the generated ast.dump(ast.parse(s)) returns 'Module(body=[])' for both versions of 's', so no help there. |
I am unable to reproduce this on 3.11: >>> tokenize.untokenize(tokenize.generate_tokens(io.StringIO('#').readline))
'#' |
bpo-44667 tokenize.py emits spurious NEWLINE if file ends on a comment without a newline |
- Fix test break, since `tokenize.tokenize` has buggy behavior that wasn't backported. - See python/cpython#79288 and python/cpython#88833. - Adjust typing using so everything from typing is prepended with `typing.`.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: