tokenize/untokenize roundtrip fails with tabs #64586

jaraco · 2014-01-24T22:51:15Z

BPO	20387
Nosy	@terryjreedy, @jaraco, @benjaminp, @vadmium
Files	tokenize.patch: patch for tokenize.py tokenize.patch: fixed patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/jaraco'
closed_at = <Date 2015-11-19.00:53:07.048>
created_at = <Date 2014-01-24.22:51:15.090>
labels = []
title = 'tokenize/untokenize roundtrip fails with tabs'
updated_at = <Date 2015-11-19.00:53:07.048>
user = 'https://github.com/jaraco'

bugs.python.org fields:

activity = <Date 2015-11-19.00:53:07.048>
actor = 'jaraco'
assignee = 'jaraco'
closed = True
closed_date = <Date 2015-11-19.00:53:07.048>
closer = 'jaraco'
components = []
creation = <Date 2014-01-24.22:51:15.090>
creator = 'jaraco'
dependencies = []
files = ['39748', '39763']
hgrepos = ['313']
issue_num = 20387
keywords = ['patch']
message_count = 19.0
messages = ['209137', '209946', '209964', '209965', '209967', '245515', '245547', '245563', '245576', '245577', '245613', '245834', '245856', '245908', '245910', '245911', '245912', '245913', '254862']
nosy_count = 7.0
nosy_names = ['terry.reedy', 'jaraco', 'benjamin.peterson', 'Arfrever', 'python-dev', 'martin.panter', 'gumblex']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue20387'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

jaraco · 2014-01-24T22:51:15Z

Consider this simple unit test:

def test_tokenize():
    input = "if False:\n\tx=3\n\ty=3\n"
    g = list(generate_tokens(io.StringIO(input).readline))
    assert untokenize(g) == input

According to the docs, untokenize guarantees the output equals the input to tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize

As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I believe the test illustrates a violation of this guarantee. The second line in the if block gets its tab replaced by a space.

As the input is valid Python Syntax, this behavior is particularly surprising.

terryjreedy · 2014-02-02T04:42:54Z

Whitespace equivalence is explicitly disclaimed. "The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change." The assert is not a valid test. I think you should close this. (Note that there are several issues for bugs where untokenize results in a sematically different text.)

terryjreedy · 2014-02-02T08:30:17Z

I read the manual more carefully and noticed that the guarantee is that tokenizing the result of untokenize matches the input to untokenize. " The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured."

I ran the test in 3.4 and noticed the exactness and direction of the matching does not matter because only the 2nd \t is replaced by a space, making the invalid code that raises IndentationError.

So this is definitely a bug and I withdraw the close suggestion.

I believe the test should be that both lines of the body begin with the same indent.

from tokenize import tokenize, untokenize
import io

    def test_tab_indent(self):
        code = b"if False:\n\tx=3\n\ty=3\n"
        codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n)
        assertEqual(codelines[1], codelines[2])

Of course, the test of tokenize and untokenize should be separated.

terryjreedy · 2014-02-02T10:08:37Z

I think the problem is with untokenize.

s =b"if False:\n\tx=3\n\ty=3\n"
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i)

produces a token stream that seems correct.

TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n')
TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n')
TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n')
TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this:

s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
>>> 
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.

terryjreedy · 2014-02-02T10:30:38Z

The untokenize docstring has a stronger guarantee, and in the direction you were claiming. "Round-trip invariant for full input: Untokenized source will match input source exactly". For this to be true, the indent strings must be saved and not replaced by spaces.

terryjreedy · 2015-06-19T19:33:18Z

bpo-24447, closed as duplicate, has another example.

gumblex · 2015-06-20T07:13:46Z

Sorry for the inconvenience. I failed to find this old bug.

I think there is another problem. The docs of untokenize said "The iterable must return sequences with at least two elements, the token type and the token string. Any additional sequence elements are ignored.", so if I feed in, say, a 3-tuple, the untokenize should accept it as tok[:2].

The attached patch should have addressed the problems above.

When trying to make a patch, a tokenize bug was found. Consider the new attached tab.py, the tabs between comments and code, and the tabs between expressions are lost, so when untokenizing, position information is used to produce equivalent spaces, instead of tabs.

Despite the tokenization problem, the patch can produce syntactically correct code as accurately as it can.

The PEP-8 recommends spaces for indentation, but the usage of tabs should not be ignored.

new tab.py (in Python string):

'#!/usr/bin/env python\n# -- coding: utf-8 --\n\ndef foo():\n\t"""\n\tTests tabs in tokenization\n\t\tfoo\n\t"""\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = {\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 2:\n\t#pass\n#pass\n'

jaraco · 2015-06-20T15:14:02Z

@gumblex: This is a good start. It certainly provides a candidate implementation.

First, can I suggest that you remove the changes pertinent to the "at least two elements" and address that in a separate ticket/discussion?

Second, any patch will necessarily need to include a test that fails before the patch and passes after the test.

Third, and least important, the proposed implementation adds a lot of branching and cognitive complexity to an already complicated function. I'd like to work on the implementation to see if we can make it easier to comprehend. That said, practicality beats purity, so if it passes the test, then it's surely acceptable.

jaraco · 2015-06-20T23:58:18Z

I've created a repo clone and have added a version of Terry's test to it and will now test Dingyuan's patch.

jaraco · 2015-06-21T00:14:43Z

I've committed the patch without the change for "at least two elements" as https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2.

The patch corrects the new test, demonstrating its effectiveness, but yields two new test failures (apparent regressions):

$ ./python.exe Lib/test/test_tokenize.py 
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
    roundtrip("try: import somemodule\n"
              "except ImportError: # comment\n"
              "    print('Can not import' # comment2\n)"
              "else:   print('Loaded')\n")
Exception raised:
    Traceback (most recent call last):
      File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
        compileflags, 1), test.globs)
      File "<doctest test.test_tokenize.__test__.doctests[14]>", line 1, in <module>
        roundtrip("try: import somemodule\n"
      File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
        bytes_from5 = untokenize(tokens5)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
        out = ut.untokenize(iterable)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
        self.add_whitespace(start)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
        .format(row, col, self.prev_row, self.prev_col))
    ValueError: start (4,1) precedes previous end (4,4)
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
    for testfile in testfiles:
        if not roundtrip(open(testfile, 'rb')):
            print("Roundtrip failed for file %s" % testfile)
            break
    else: True
Exception raised:
    Traceback (most recent call last):
      File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
        compileflags, 1), test.globs)
      File "<doctest test.test_tokenize.__test__.doctests[70]>", line 2, in <module>
        if not roundtrip(open(testfile, 'rb')):
      File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
        bytes_from5 = untokenize(tokens5)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
        out = ut.untokenize(iterable)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
        self.add_whitespace(start)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
        .format(row, col, self.prev_row, self.prev_col))
    ValueError: start (73,8) precedes previous end (73,12)
**********************************************************************
1 items had failures:
   2 of  74 in test.test_tokenize.__test__.doctests
***Test Failed*** 2 failures.
Traceback (most recent call last):
  File "Lib/test/test_tokenize.py", line 1261, in <module>
    test_main()
  File "Lib/test/test_tokenize.py", line 1252, in test_main
    support.run_doctest(test_tokenize, True)
  File "/public/cpython-issue20387/Lib/test/support/__init__.py", line 1854, in run_doctest
    raise TestFailed("%d of %d doctests failed" % (f, t))
test.support.TestFailed: 2 of 79 doctests failed

@gumblex, Can you review the failures and address them?

gumblex · 2015-06-22T02:33:12Z

The new patch should now pass all tests correctly.

The main idea is:

if the token is INDENT, push it on the indents stack and continue
if a new line starts, AND the position of the first token >= the length of the last indent level, we assume the indent is contained in the leading whitespaces.

The new test_tokenize.py fails:

https://bitbucket.org/jaraco/cpython-issue20387/src/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2/Lib/test/test_tokenize.py?at=3.4#cl-1244

Line 1244 should be:
codelines = self.roundtrip(code).split(b'\n')

It seems that the tokens generated by tokenize.tokenize don't contain enough information to restore the original file.

Tabs between tokens are not preserved.
Space before backslash as line continuation are not preserved.

(From test/tokenize_tests.txt)
 # Backslash means line continuation:
-x = 1 \
+x = 1\
 + 1

My roundtrip test code copied here from bpo-24447:

python2 -c 'import sys, tokenize; sys.stdout.write(tokenize.untokenize(tokenize.generate_tokens(sys.stdin.readline)))'
python3 -c 'import sys, tokenize; sys.stdout.buffer.write(tokenize.untokenize(tokenize.tokenize(sys.stdin.buffer.readline)))'

jaraco · 2015-06-26T03:12:26Z

@gumblex, I've applied your updated patch (though I couldn't figure out why it wouldn't apply mechanically; I had to paste it). I also corrected the test. Thanks for the advice on that.

I don't understand the second half of your message. Are you stating caveats for our edification? Do you have outstanding concerns? Can you confirm your original report is corrected by the latest tip in the patch repository?

I believe what remains to be done is to add a NEWS entry and merge the changes with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that could be done also.

gumblex · 2015-06-26T15:21:45Z

I mean the patch only restores tabs in indentation. The reports above should be corrected.
Tabs between tokens and other race conditions can't be restored exactly providing the token stream. This won't affect the syntax. I wonder if it's also a bug or a wont-fix in tokenize/untokenize roundtrip because fixing it involves adding whitespace information in the token stream.

(I can't pull down the whole hg repo, so the patch is manually created by just diffing two versions.)

python-dev · 2015-06-28T15:17:53Z

New changeset b784c842a63c by Jason R. Coombs in branch '3.4':
Issue bpo-20387: Add test capturing failure to roundtrip indented code in tokenize module.
https://hg.python.org/cpython/rev/b784c842a63c

New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4':
Issue bpo-20387: Correct test to properly capture expectation.
https://hg.python.org/cpython/rev/49323e5f6391

New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4':
Issue bpo-20387: Restore retention of indentation during untokenize.
https://hg.python.org/cpython/rev/ff47efeeed48

New changeset 4856ae883041 by Jason R. Coombs in branch '3.4':
Issue bpo-20387: Update Misc/NEWS
https://hg.python.org/cpython/rev/4856ae883041

New changeset 330e28b28334 by Jason R. Coombs in branch '3.4':
Issue bpo-20387: Merge patch and test
https://hg.python.org/cpython/rev/330e28b28334

New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4':
Issue bpo-20387: Merge
https://hg.python.org/cpython/rev/9ce5c1f371f7

New changeset 98380a6e037c by Jason R. Coombs in branch '3.5':
Issue bpo-20387: Merge test and patch from 3.4.4
https://hg.python.org/cpython/rev/98380a6e037c

New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default':
Issue bpo-20387: Merge with 3.5
https://hg.python.org/cpython/rev/f2f5d1c928eb

jaraco · 2015-06-28T15:47:54Z

Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as that was where I encountered it originally.

jaraco · 2015-06-28T15:49:17Z

Benjamin, any objections to a backport of this patch?

python-dev · 2015-06-28T17:09:01Z

New changeset 524a0e755797 by Jason R. Coombs in branch '2.7':
Issue bpo-20387: Backport test from Python 3.4
https://hg.python.org/cpython/rev/524a0e755797

New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7':
Issue bpo-20387: Backport fix from Python 3.4
https://hg.python.org/cpython/rev/cb9df1ae287b

jaraco · 2015-06-28T17:11:10Z

For the sake of expediency, I've gone ahead and backported and pushed the fix to 2.7. Please back out the changes if appropriate.

vadmium · 2015-11-18T23:07:58Z

It seems the problem with tabs in the indentation is fixed. Can we close this?

For the problem with tabs between tokens in a line, I think that would be better handled in separate report. But I suspect it is at worst a documentation problem. You can’t guarantee that you can reconstruct the exact source based off just the tokenized tokens.

jaraco self-assigned this Jun 20, 2015

jaraco closed this as completed Nov 19, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize/untokenize roundtrip fails with tabs #64586

tokenize/untokenize roundtrip fails with tabs #64586

jaraco commented Jan 24, 2014

jaraco commented Jan 24, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Jun 19, 2015

gumblex mannequin commented Jun 20, 2015

jaraco commented Jun 20, 2015

jaraco commented Jun 20, 2015

jaraco commented Jun 21, 2015

gumblex mannequin commented Jun 22, 2015

jaraco commented Jun 26, 2015

gumblex mannequin commented Jun 26, 2015

python-dev mannequin commented Jun 28, 2015

jaraco commented Jun 28, 2015

jaraco commented Jun 28, 2015

python-dev mannequin commented Jun 28, 2015

jaraco commented Jun 28, 2015

vadmium commented Nov 18, 2015

tokenize/untokenize roundtrip fails with tabs #64586

tokenize/untokenize roundtrip fails with tabs #64586

Comments

jaraco commented Jan 24, 2014

jaraco commented Jan 24, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Feb 2, 2014

terryjreedy commented Jun 19, 2015

gumblex mannequin commented Jun 20, 2015

jaraco commented Jun 20, 2015

jaraco commented Jun 20, 2015

jaraco commented Jun 21, 2015

gumblex mannequin commented Jun 22, 2015

jaraco commented Jun 26, 2015

gumblex mannequin commented Jun 26, 2015

python-dev mannequin commented Jun 28, 2015

jaraco commented Jun 28, 2015

jaraco commented Jun 28, 2015

python-dev mannequin commented Jun 28, 2015

jaraco commented Jun 28, 2015

vadmium commented Nov 18, 2015