Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize/untokenize roundtrip fails with tabs #64586

Closed
jaraco opened this issue Jan 24, 2014 · 19 comments
Closed

tokenize/untokenize roundtrip fails with tabs #64586

jaraco opened this issue Jan 24, 2014 · 19 comments
Assignees

Comments

@jaraco
Copy link
Member

jaraco commented Jan 24, 2014

BPO 20387
Nosy @terryjreedy, @jaraco, @benjaminp, @vadmium
Files
  • tokenize.patch: patch for tokenize.py
  • tokenize.patch: fixed patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/jaraco'
    closed_at = <Date 2015-11-19.00:53:07.048>
    created_at = <Date 2014-01-24.22:51:15.090>
    labels = []
    title = 'tokenize/untokenize roundtrip fails with tabs'
    updated_at = <Date 2015-11-19.00:53:07.048>
    user = 'https://github.com/jaraco'

    bugs.python.org fields:

    activity = <Date 2015-11-19.00:53:07.048>
    actor = 'jaraco'
    assignee = 'jaraco'
    closed = True
    closed_date = <Date 2015-11-19.00:53:07.048>
    closer = 'jaraco'
    components = []
    creation = <Date 2014-01-24.22:51:15.090>
    creator = 'jaraco'
    dependencies = []
    files = ['39748', '39763']
    hgrepos = ['313']
    issue_num = 20387
    keywords = ['patch']
    message_count = 19.0
    messages = ['209137', '209946', '209964', '209965', '209967', '245515', '245547', '245563', '245576', '245577', '245613', '245834', '245856', '245908', '245910', '245911', '245912', '245913', '254862']
    nosy_count = 7.0
    nosy_names = ['terry.reedy', 'jaraco', 'benjamin.peterson', 'Arfrever', 'python-dev', 'martin.panter', 'gumblex']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue20387'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @jaraco
    Copy link
    Member Author

    jaraco commented Jan 24, 2014

    Consider this simple unit test:

    def test_tokenize():
        input = "if False:\n\tx=3\n\ty=3\n"
        g = list(generate_tokens(io.StringIO(input).readline))
        assert untokenize(g) == input

    According to the docs, untokenize guarantees the output equals the input to tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize

    As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I believe the test illustrates a violation of this guarantee. The second line in the if block gets its tab replaced by a space.

    As the input is valid Python Syntax, this behavior is particularly surprising.

    @terryjreedy
    Copy link
    Member

    Whitespace equivalence is explicitly disclaimed. "The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change." The assert is not a valid test. I think you should close this. (Note that there are several issues for bugs where untokenize results in a sematically different text.)

    @terryjreedy
    Copy link
    Member

    I read the manual more carefully and noticed that the guarantee is that tokenizing the result of untokenize matches the input to untokenize. " The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured."

    I ran the test in 3.4 and noticed the exactness and direction of the matching does not matter because only the 2nd \t is replaced by a space, making the invalid code that raises IndentationError.

    So this is definitely a bug and I withdraw the close suggestion.

    I believe the test should be that both lines of the body begin with the same indent.

    from tokenize import tokenize, untokenize
    import io
    
        def test_tab_indent(self):
            code = b"if False:\n\tx=3\n\ty=3\n"
            codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n)
            assertEqual(codelines[1], codelines[2])

    Of course, the test of tokenize and untokenize should be separated.

    @terryjreedy
    Copy link
    Member

    I think the problem is with untokenize.

    s =b"if False:\n\tx=3\n\ty=3\n"
    t = tokenize(io.BytesIO(s).readline)
    for i in t: print(i)

    produces a token stream that seems correct.

    TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
    TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n')
    TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n')
    TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n')
    TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n')
    TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n')
    TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n')
    TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n')
    TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n')
    TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n')
    TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n')
    TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n')
    TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n')
    TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n')
    TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='')
    TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='')

    The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this:

    s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n"
    t = tokenize(io.BytesIO(s).readline)
    print(untokenize(i[:2] for i in t).encode())
    >>> 
    b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n'

    When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing.

    @terryjreedy
    Copy link
    Member

    The untokenize docstring has a stronger guarantee, and in the direction you were claiming. "Round-trip invariant for full input: Untokenized source will match input source exactly". For this to be true, the indent strings must be saved and not replaced by spaces.

    @terryjreedy
    Copy link
    Member

    bpo-24447, closed as duplicate, has another example.

    @gumblex
    Copy link
    Mannequin

    gumblex mannequin commented Jun 20, 2015

    Sorry for the inconvenience. I failed to find this old bug.

    I think there is another problem. The docs of untokenize said "The iterable must return sequences with at least two elements, the token type and the token string. Any additional sequence elements are ignored.", so if I feed in, say, a 3-tuple, the untokenize should accept it as tok[:2].

    The attached patch should have addressed the problems above.

    When trying to make a patch, a tokenize bug was found. Consider the new attached tab.py, the tabs between comments and code, and the tabs between expressions are lost, so when untokenizing, position information is used to produce equivalent spaces, instead of tabs.

    Despite the tokenization problem, the patch can produce syntactically correct code as accurately as it can.

    The PEP-8 recommends spaces for indentation, but the usage of tabs should not be ignored.

    new tab.py (in Python string):

    '#!/usr/bin/env python\n# -- coding: utf-8 --\n\ndef foo():\n\t"""\n\tTests tabs in tokenization\n\t\tfoo\n\t"""\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = {\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 2:\n\t#pass\n#pass\n'

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 20, 2015

    @gumblex: This is a good start. It certainly provides a candidate implementation.

    First, can I suggest that you remove the changes pertinent to the "at least two elements" and address that in a separate ticket/discussion?

    Second, any patch will necessarily need to include a test that fails before the patch and passes after the test.

    Third, and least important, the proposed implementation adds a lot of branching and cognitive complexity to an already complicated function. I'd like to work on the implementation to see if we can make it easier to comprehend. That said, practicality beats purity, so if it passes the test, then it's surely acceptable.

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 20, 2015

    I've created a repo clone and have added a version of Terry's test to it and will now test Dingyuan's patch.

    @jaraco jaraco self-assigned this Jun 20, 2015
    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 21, 2015

    I've committed the patch without the change for "at least two elements" as https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2.

    The patch corrects the new test, demonstrating its effectiveness, but yields two new test failures (apparent regressions):

    $ ./python.exe Lib/test/test_tokenize.py 
    **********************************************************************
    File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
    Failed example:
        roundtrip("try: import somemodule\n"
                  "except ImportError: # comment\n"
                  "    print('Can not import' # comment2\n)"
                  "else:   print('Loaded')\n")
    Exception raised:
        Traceback (most recent call last):
          File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
            compileflags, 1), test.globs)
          File "<doctest test.test_tokenize.__test__.doctests[14]>", line 1, in <module>
            roundtrip("try: import somemodule\n"
          File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
            bytes_from5 = untokenize(tokens5)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
            out = ut.untokenize(iterable)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
            self.add_whitespace(start)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
            .format(row, col, self.prev_row, self.prev_col))
        ValueError: start (4,1) precedes previous end (4,4)
    **********************************************************************
    File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
    Failed example:
        for testfile in testfiles:
            if not roundtrip(open(testfile, 'rb')):
                print("Roundtrip failed for file %s" % testfile)
                break
        else: True
    Exception raised:
        Traceback (most recent call last):
          File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run
            compileflags, 1), test.globs)
          File "<doctest test.test_tokenize.__test__.doctests[70]>", line 2, in <module>
            if not roundtrip(open(testfile, 'rb')):
          File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip
            bytes_from5 = untokenize(tokens5)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize
            out = ut.untokenize(iterable)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize
            self.add_whitespace(start)
          File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace
            .format(row, col, self.prev_row, self.prev_col))
        ValueError: start (73,8) precedes previous end (73,12)
    **********************************************************************
    1 items had failures:
       2 of  74 in test.test_tokenize.__test__.doctests
    ***Test Failed*** 2 failures.
    Traceback (most recent call last):
      File "Lib/test/test_tokenize.py", line 1261, in <module>
        test_main()
      File "Lib/test/test_tokenize.py", line 1252, in test_main
        support.run_doctest(test_tokenize, True)
      File "/public/cpython-issue20387/Lib/test/support/__init__.py", line 1854, in run_doctest
        raise TestFailed("%d of %d doctests failed" % (f, t))
    test.support.TestFailed: 2 of 79 doctests failed

    @gumblex, Can you review the failures and address them?

    @gumblex
    Copy link
    Mannequin

    gumblex mannequin commented Jun 22, 2015

    The new patch should now pass all tests correctly.

    The main idea is:

    • if the token is INDENT, push it on the indents stack and continue
    • if a new line starts, AND the position of the first token >= the length of the last indent level, we assume the indent is contained in the leading whitespaces.

    The new test_tokenize.py fails:

    https://bitbucket.org/jaraco/cpython-issue20387/src/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2/Lib/test/test_tokenize.py?at=3.4#cl-1244

    Line 1244 should be:
    codelines = self.roundtrip(code).split(b'\n')

    It seems that the tokens generated by tokenize.tokenize don't contain enough information to restore the original file.

    • Tabs between tokens are not preserved.
    • Space before backslash as line continuation are not preserved.
    (From test/tokenize_tests.txt)
     # Backslash means line continuation:
    -x = 1 \
    +x = 1\
     + 1

    My roundtrip test code copied here from bpo-24447:

    python2 -c 'import sys, tokenize; sys.stdout.write(tokenize.untokenize(tokenize.generate_tokens(sys.stdin.readline)))'
    python3 -c 'import sys, tokenize; sys.stdout.buffer.write(tokenize.untokenize(tokenize.tokenize(sys.stdin.buffer.readline)))'

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 26, 2015

    @gumblex, I've applied your updated patch (though I couldn't figure out why it wouldn't apply mechanically; I had to paste it). I also corrected the test. Thanks for the advice on that.

    I don't understand the second half of your message. Are you stating caveats for our edification? Do you have outstanding concerns? Can you confirm your original report is corrected by the latest tip in the patch repository?

    I believe what remains to be done is to add a NEWS entry and merge the changes with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that could be done also.

    @gumblex
    Copy link
    Mannequin

    gumblex mannequin commented Jun 26, 2015

    I mean the patch only restores tabs in indentation. The reports above should be corrected.
    Tabs between tokens and other race conditions can't be restored exactly providing the token stream. This won't affect the syntax. I wonder if it's also a bug or a wont-fix in tokenize/untokenize roundtrip because fixing it involves adding whitespace information in the token stream.

    (I can't pull down the whole hg repo, so the patch is manually created by just diffing two versions.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 28, 2015

    New changeset b784c842a63c by Jason R. Coombs in branch '3.4':
    Issue bpo-20387: Add test capturing failure to roundtrip indented code in tokenize module.
    https://hg.python.org/cpython/rev/b784c842a63c

    New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4':
    Issue bpo-20387: Correct test to properly capture expectation.
    https://hg.python.org/cpython/rev/49323e5f6391

    New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4':
    Issue bpo-20387: Restore retention of indentation during untokenize.
    https://hg.python.org/cpython/rev/ff47efeeed48

    New changeset 4856ae883041 by Jason R. Coombs in branch '3.4':
    Issue bpo-20387: Update Misc/NEWS
    https://hg.python.org/cpython/rev/4856ae883041

    New changeset 330e28b28334 by Jason R. Coombs in branch '3.4':
    Issue bpo-20387: Merge patch and test
    https://hg.python.org/cpython/rev/330e28b28334

    New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4':
    Issue bpo-20387: Merge
    https://hg.python.org/cpython/rev/9ce5c1f371f7

    New changeset 98380a6e037c by Jason R. Coombs in branch '3.5':
    Issue bpo-20387: Merge test and patch from 3.4.4
    https://hg.python.org/cpython/rev/98380a6e037c

    New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default':
    Issue bpo-20387: Merge with 3.5
    https://hg.python.org/cpython/rev/f2f5d1c928eb

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 28, 2015

    Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as that was where I encountered it originally.

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 28, 2015

    Benjamin, any objections to a backport of this patch?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 28, 2015

    New changeset 524a0e755797 by Jason R. Coombs in branch '2.7':
    Issue bpo-20387: Backport test from Python 3.4
    https://hg.python.org/cpython/rev/524a0e755797

    New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7':
    Issue bpo-20387: Backport fix from Python 3.4
    https://hg.python.org/cpython/rev/cb9df1ae287b

    @jaraco
    Copy link
    Member Author

    jaraco commented Jun 28, 2015

    For the sake of expediency, I've gone ahead and backported and pushed the fix to 2.7. Please back out the changes if appropriate.

    @vadmium
    Copy link
    Member

    vadmium commented Nov 18, 2015

    It seems the problem with tabs in the indentation is fixed. Can we close this?

    For the problem with tabs between tokens in a line, I think that would be better handled in separate report. But I suspect it is at worst a documentation problem. You can’t guarantee that you can reconstruct the exact source based off just the tokenized tokens.

    @jaraco jaraco closed this as completed Nov 19, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants