classification
Title: tokenize/untokenize roundtrip fails with tabs
Type: Stage: patch review
Components: Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jason.coombs Nosy List: Arfrever, benjamin.peterson, gumblex, jason.coombs, martin.panter, python-dev, terry.reedy
Priority: normal Keywords: patch

Created on 2014-01-24 22:51 by jason.coombs, last changed 2015-11-19 00:53 by jason.coombs. This issue is now closed.

Files
File name Uploaded Description Edit
tokenize.patch gumblex, 2015-06-20 07:13 patch for tokenize.py
tokenize.patch gumblex, 2015-06-22 02:33 fixed patch
Repositories containing patches
https://bitbucket.org/jaraco/cpython-issue20387#3.4
Messages (19)
msg209137 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2014-01-24 22:51
Consider this simple unit test:

def test_tokenize():
    input = "if False:\n\tx=3\n\ty=3\n"
    g = list(generate_tokens(io.StringIO(input).readline))
    assert untokenize(g) == input

According to the docs, untokenize guarantees the output equals the input to tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize

As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I believe the test illustrates a violation of this guarantee. The second line in the if block gets its tab replaced by a space.

As the input is valid Python Syntax, this behavior is particularly surprising.
msg209946 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-02 04:42
Whitespace equivalence is explicitly disclaimed. "The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change." The assert is not a valid test. I think you should close this. (Note that there are several issues for bugs where untokenize results in a sematically different text.)
msg209964 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-02 08:30
I read the manual more carefully and noticed that the guarantee is that tokenizing the result of untokenize matches the input to untokenize. " The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured."

I ran the test in 3.4 and noticed the exactness and direction of the matching does not matter because only the 2nd \t is replaced by a space, making the invalid code that raises IndentationError.

So this is definitely a bug and I withdraw the close suggestion.

I believe the test should be that both lines of the body begin with the same indent.

from tokenize import tokenize, untokenize
import io

    def test_tab_indent(self):
        code = b"if False:\n\tx=3\n\ty=3\n"
        codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n)
        assertEqual(codelines[1], codelines[2])

Of course, the test of tokenize and untokenize should be separated.
msg209965 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-02 10:08
I think the problem is with untokenize.

s =b"if False:\n\tx=3\n\ty=3\n"
t = tokenize(io.BytesIO(s).readline)
for i in t: print(i)

produces a token stream that seems correct.

TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n')
TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n')
TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n')
TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n')
TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this:

s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
t = tokenize(io.BytesIO(s).readline)
print(untokenize(i[:2] for i in t).encode())
>>> 
b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.
msg209967 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-02 10:30
The untokenize docstring has a stronger guarantee, and in the direction you were claiming. "Round-trip invariant for full input: Untokenized source will match input source exactly". For this to be true, the indent strings must be saved and not replaced by spaces.
msg245515 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-06-19 19:33
#24447, closed as duplicate, has another example.
msg245547 - (view) Author: Dingyuan Wang (gumblex) * Date: 2015-06-20 07:13
Sorry for the inconvenience. I failed to find this old bug.

I think there is another problem. The docs of `untokenize` said "The iterable must return sequences with **at least** two elements, the token type and the token string. Any additional sequence elements are ignored.", so if I feed in, say, a 3-tuple, the untokenize should accept it as tok[:2].

The attached patch should have addressed the problems above. 

When trying to make a patch, a tokenize bug was found. Consider the new attached tab.py, the tabs between comments and code, and the tabs between expressions are lost, so when untokenizing, position information is used to produce equivalent spaces, instead of tabs.

Despite the tokenization problem, the patch can produce syntactically correct code as accurately as it can.

The PEP 8 recommends spaces for indentation, but the usage of tabs should not be ignored.

new tab.py (in Python string):

'#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\ndef foo():\n\t"""\n\tTests tabs in tokenization\n\t\tfoo\n\t"""\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = {\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 2:\n\t#pass\n#pass\n'
msg245563 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-20 15:14
@gumblex: This is a good start. It certainly provides a candidate implementation.

First, can I suggest that you remove the changes pertinent to the "at least two elements" and address that in a separate ticket/discussion?

Second, any patch will necessarily need to include a test that fails before the patch and passes after the test.

Third, and least important, the proposed implementation adds a lot of branching and cognitive complexity to an already complicated function. I'd like to work on the implementation to see if we can make it easier to comprehend. That said, practicality beats purity, so if it passes the test, then it's surely acceptable.
msg245576 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-20 23:58
I've created a repo clone and have added a version of Terry's test to it and will now test Dingyuan's patch.
msg245577 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-21 00:14
I've committed the patch without the change for "at least two elements" as https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2.

The patch corrects the new test, demonstrating its effectiveness, but yields two new test failures (apparent regressions):

$ ./python.exe Lib/test/test_tokenize.py 
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
    roundtrip("try: import somemodule\n"
              "except ImportError: # comment\n"
              "    print('Can not import' # comment2\n)"
              "else:   print('Loaded')\n")
Exception raised:
    Traceback (most recent call last):
      File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
        compileflags, 1), test.globs)
      File "<doctest test.test_tokenize.__test__.doctests[14]>", line 1, in <module>
        roundtrip("try: import somemodule\n"
      File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
        bytes_from5 = untokenize(tokens5)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
        out = ut.untokenize(iterable)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
        self.add_whitespace(start)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
        .format(row, col, self.prev_row, self.prev_col))
    ValueError: start (4,1) precedes previous end (4,4)
**********************************************************************
File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
    for testfile in testfiles:
        if not roundtrip(open(testfile, 'rb')):
            print("Roundtrip failed for file %s" % testfile)
            break
    else: True
Exception raised:
    Traceback (most recent call last):
      File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
        compileflags, 1), test.globs)
      File "<doctest test.test_tokenize.__test__.doctests[70]>", line 2, in <module>
        if not roundtrip(open(testfile, 'rb')):
      File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
        bytes_from5 = untokenize(tokens5)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
        out = ut.untokenize(iterable)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
        self.add_whitespace(start)
      File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
        .format(row, col, self.prev_row, self.prev_col))
    ValueError: start (73,8) precedes previous end (73,12)
**********************************************************************
1 items had failures:
   2 of  74 in test.test_tokenize.__test__.doctests
***Test Failed*** 2 failures.
Traceback (most recent call last):
  File "Lib/test/test_tokenize.py", line 1261, in <module>
    test_main()
  File "Lib/test/test_tokenize.py", line 1252, in test_main
    support.run_doctest(test_tokenize, True)
  File "/public/cpython-issue20387/Lib/test/support/__init__.py", line 1854, in run_doctest
    raise TestFailed("%d of %d doctests failed" % (f, t))
test.support.TestFailed: 2 of 79 doctests failed

@gumblex, Can you review the failures and address them?
msg245613 - (view) Author: Dingyuan Wang (gumblex) * Date: 2015-06-22 02:33
The new patch should now pass all tests correctly.

The main idea is:
* if the token is INDENT, push it on the `indents` stack and continue
* if a new line starts, AND the position of the first token >= the length of the last indent level, we assume the indent is contained in the leading whitespaces.

The new test_tokenize.py fails:

https://bitbucket.org/jaraco/cpython-issue20387/src/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2/Lib/test/test_tokenize.py?at=3.4#cl-1244

Line 1244 should be:
codelines = self.roundtrip(code).split(b'\n')

It seems that the tokens generated by tokenize.tokenize don't contain enough information to restore the original file.

* Tabs between tokens are not preserved.
* Space before backslash as line continuation are not preserved.

(From test/tokenize_tests.txt)
 # Backslash means line continuation:
-x = 1 \
+x = 1\
 + 1

My roundtrip test code copied here from #24447:

python2 -c 'import sys, tokenize; sys.stdout.write(tokenize.untokenize(tokenize.generate_tokens(sys.stdin.readline)))'
python3 -c 'import sys, tokenize; sys.stdout.buffer.write(tokenize.untokenize(tokenize.tokenize(sys.stdin.buffer.readline)))'
msg245834 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-26 03:12
@gumblex, I've applied your updated patch (though I couldn't figure out why it wouldn't apply mechanically; I had to paste it). I also corrected the test. Thanks for the advice on that.

I don't understand the second half of your message. Are you stating caveats for our edification? Do you have outstanding concerns? Can you confirm your original report is corrected by the latest tip in the patch repository?

I believe what remains to be done is to add a NEWS entry and merge the changes with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that could be done also.
msg245856 - (view) Author: Dingyuan Wang (gumblex) * Date: 2015-06-26 15:21
I mean the patch only restores tabs in indentation. The reports above should be corrected.
Tabs between tokens and other race conditions can't be restored exactly providing the token stream. This won't affect the syntax. I wonder if it's also a bug or a wont-fix in tokenize/untokenize roundtrip because fixing it involves adding whitespace information in the token stream.

(I can't pull down the whole hg repo, so the patch is manually created by just diffing two versions.)
msg245908 - (view) Author: Roundup Robot (python-dev) Date: 2015-06-28 15:17
New changeset b784c842a63c by Jason R. Coombs in branch '3.4':
Issue #20387: Add test capturing failure to roundtrip indented code in tokenize module.
https://hg.python.org/cpython/rev/b784c842a63c

New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4':
Issue #20387: Correct test to properly capture expectation.
https://hg.python.org/cpython/rev/49323e5f6391

New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4':
Issue #20387: Restore retention of indentation during untokenize.
https://hg.python.org/cpython/rev/ff47efeeed48

New changeset 4856ae883041 by Jason R. Coombs in branch '3.4':
Issue #20387: Update Misc/NEWS
https://hg.python.org/cpython/rev/4856ae883041

New changeset 330e28b28334 by Jason R. Coombs in branch '3.4':
Issue #20387: Merge patch and test
https://hg.python.org/cpython/rev/330e28b28334

New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4':
Issue #20387: Merge
https://hg.python.org/cpython/rev/9ce5c1f371f7

New changeset 98380a6e037c by Jason R. Coombs in branch '3.5':
Issue #20387: Merge test and patch from 3.4.4
https://hg.python.org/cpython/rev/98380a6e037c

New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default':
Issue #20387: Merge with 3.5
https://hg.python.org/cpython/rev/f2f5d1c928eb
msg245910 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-28 15:47
Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as that was where I encountered it originally.
msg245911 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-28 15:49
Benjamin, any objections to a backport of this patch?
msg245912 - (view) Author: Roundup Robot (python-dev) Date: 2015-06-28 17:09
New changeset 524a0e755797 by Jason R. Coombs in branch '2.7':
Issue #20387: Backport test from Python 3.4
https://hg.python.org/cpython/rev/524a0e755797

New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7':
Issue #20387: Backport fix from Python 3.4
https://hg.python.org/cpython/rev/cb9df1ae287b
msg245913 - (view) Author: Jason R. Coombs (jason.coombs) * (Python committer) Date: 2015-06-28 17:11
For the sake of expediency, I've gone ahead and backported and pushed the fix to 2.7. Please back out the changes if appropriate.
msg254862 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-11-18 23:07
It seems the problem with tabs in the indentation is fixed. Can we close this?

For the problem with tabs between tokens in a line, I think that would be better handled in separate report. But I suspect it is at worst a documentation problem. You can’t guarantee that you can reconstruct the exact source based off just the tokenized tokens.
History
Date User Action Args
2015-11-19 00:53:07jason.coombssetstatus: open -> closed
2015-11-18 23:07:58martin.pantersetnosy: + martin.panter
messages: + msg254862
2015-06-28 17:11:10jason.coombssetresolution: fixed
messages: + msg245913
2015-06-28 17:09:01python-devsetmessages: + msg245912
2015-06-28 15:49:17jason.coombssetnosy: + benjamin.peterson
messages: + msg245911
2015-06-28 15:47:54jason.coombssetmessages: + msg245910
versions: + Python 3.5, - Python 3.3
2015-06-28 15:17:53python-devsetnosy: + python-dev
messages: + msg245908
2015-06-26 15:21:44gumblexsetmessages: + msg245856
2015-06-26 03:12:26jason.coombssetmessages: + msg245834
stage: needs patch -> patch review
2015-06-22 02:33:13gumblexsetfiles: + tokenize.patch

messages: + msg245613
2015-06-21 00:14:44jason.coombssetmessages: + msg245577
2015-06-20 23:58:18jason.coombssetassignee: jason.coombs
messages: + msg245576
hgrepos: + hgrepo313
2015-06-20 15:14:02jason.coombssetmessages: + msg245563
2015-06-20 07:13:46gumblexsetfiles: + tokenize.patch

nosy: + gumblex
messages: + msg245547

keywords: + patch
2015-06-19 19:33:18terry.reedysetmessages: + msg245515
2015-06-19 19:32:43terry.reedylinkissue24447 superseder
2014-02-02 10:30:38terry.reedysetmessages: + msg209967
2014-02-02 10:08:37terry.reedysetmessages: + msg209965
2014-02-02 08:30:17terry.reedysetstage: needs patch
messages: + msg209964
versions: + Python 3.3
2014-02-02 04:42:54terry.reedysetnosy: + terry.reedy
messages: + msg209946
2014-01-24 23:08:28Arfreversetnosy: + Arfrever
2014-01-24 22:51:15jason.coombscreate