Issue20387
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014-01-24 22:51 by jaraco, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
tokenize.patch | gumblex, 2015-06-20 07:13 | patch for tokenize.py | ||
tokenize.patch | gumblex, 2015-06-22 02:33 | fixed patch |
Repositories containing patches | |||
---|---|---|---|
https://bitbucket.org/jaraco/cpython-issue20387#3.4 |
Messages (19) | |||
---|---|---|---|
msg209137 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2014-01-24 22:51 | |
Consider this simple unit test: def test_tokenize(): input = "if False:\n\tx=3\n\ty=3\n" g = list(generate_tokens(io.StringIO(input).readline)) assert untokenize(g) == input According to the docs, untokenize guarantees the output equals the input to tokenize: http://docs.python.org/2/library/tokenize.html#tokenize.untokenize As this test fails on Python 2 and Python 3 (use io.BytesIO on Python 2), I believe the test illustrates a violation of this guarantee. The second line in the if block gets its tab replaced by a space. As the input is valid Python Syntax, this behavior is particularly surprising. |
|||
msg209946 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2014-02-02 04:42 | |
Whitespace equivalence is explicitly disclaimed. "The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change." The assert is not a valid test. I think you should close this. (Note that there are several issues for bugs where untokenize results in a sematically different text.) |
|||
msg209964 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2014-02-02 08:30 | |
I read the manual more carefully and noticed that the guarantee is that tokenizing the result of untokenize matches the input to untokenize. " The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured." I ran the test in 3.4 and noticed the exactness and direction of the matching does not matter because only the 2nd \t is replaced by a space, making the invalid code that raises IndentationError. So this is definitely a bug and I withdraw the close suggestion. I believe the test should be that both lines of the body begin with the same indent. from tokenize import tokenize, untokenize import io def test_tab_indent(self): code = b"if False:\n\tx=3\n\ty=3\n" codelines = untokenize(tokenize(io.BytesIO(s).readline)).split(\n) assertEqual(codelines[1], codelines[2]) Of course, the test of tokenize and untokenize should be separated. |
|||
msg209965 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2014-02-02 10:08 | |
I think the problem is with untokenize. s =b"if False:\n\tx=3\n\ty=3\n" t = tokenize(io.BytesIO(s).readline) for i in t: print(i) produces a token stream that seems correct. TokenInfo(type=56 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') TokenInfo(type=1 (NAME), string='if', start=(1, 0), end=(1, 2), line='if False:\n') TokenInfo(type=1 (NAME), string='False', start=(1, 3), end=(1, 8), line='if False:\n') TokenInfo(type=52 (OP), string=':', start=(1, 8), end=(1, 9), line='if False:\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='if False:\n') TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tx=3\n') TokenInfo(type=1 (NAME), string='x', start=(2, 1), end=(2, 2), line='\tx=3\n') TokenInfo(type=52 (OP), string='=', start=(2, 2), end=(2, 3), line='\tx=3\n') TokenInfo(type=2 (NUMBER), string='3', start=(2, 3), end=(2, 4), line='\tx=3\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 4), end=(2, 5), line='\tx=3\n') TokenInfo(type=1 (NAME), string='y', start=(3, 1), end=(3, 2), line='\ty=3\n') TokenInfo(type=52 (OP), string='=', start=(3, 2), end=(3, 3), line='\ty=3\n') TokenInfo(type=2 (NUMBER), string='3', start=(3, 3), end=(3, 4), line='\ty=3\n') TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 4), end=(3, 5), line='\ty=3\n') TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='') TokenInfo(type=0 (ENDMARKER), string='', start=(4, 0), end=(4, 0), line='') The problem with untokenize and indents is this: In the old untokenize duples function, now called 'compat', INDENT strings were added to a list and popped by the corresponding DEDENT. While compat has the minor problem of returning a string instead of bytes (which is actually as I think it should be) and adding extraneous spaces within and at the end of lines, it correctly handles tabs in your example and this: s =b"if False:\n\tx=1\n\t\ty=2\n\t\t\tz=3\n" t = tokenize(io.BytesIO(s).readline) print(untokenize(i[:2] for i in t).encode()) >>> b'if False :\n\tx =1 \n\t\ty =2 \n\t\t\tz =3 \n' When tokenize was changed to producing 5-tuples, untokenize was changed to use the start and end coordinates, but all special processing of indents was cut in favor of .add_space(). So this issue is a regression due in inadequate testing. |
|||
msg209967 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2014-02-02 10:30 | |
The untokenize docstring has a stronger guarantee, and in the direction you were claiming. "Round-trip invariant for full input: Untokenized source will match input source exactly". For this to be true, the indent strings must be saved and not replaced by spaces. |
|||
msg245515 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2015-06-19 19:33 | |
#24447, closed as duplicate, has another example. |
|||
msg245547 - (view) | Author: Dingyuan Wang (gumblex) * | Date: 2015-06-20 07:13 | |
Sorry for the inconvenience. I failed to find this old bug. I think there is another problem. The docs of `untokenize` said "The iterable must return sequences with **at least** two elements, the token type and the token string. Any additional sequence elements are ignored.", so if I feed in, say, a 3-tuple, the untokenize should accept it as tok[:2]. The attached patch should have addressed the problems above. When trying to make a patch, a tokenize bug was found. Consider the new attached tab.py, the tabs between comments and code, and the tabs between expressions are lost, so when untokenizing, position information is used to produce equivalent spaces, instead of tabs. Despite the tokenization problem, the patch can produce syntactically correct code as accurately as it can. The PEP 8 recommends spaces for indentation, but the usage of tabs should not be ignored. new tab.py (in Python string): '#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\ndef foo():\n\t"""\n\tTests tabs in tokenization\n\t\tfoo\n\t"""\n\tpass\n\tpass\n\tif 1:\n\t\t# not indent correctly\n\t\tpass\n\t\t# correct\ttab\n\t\tpass\n\tpass\n\tbaaz = {\'a\ttab\':\t1,\n\t\t\t\'b\': 2}\t\t# also fails\n\npass\n#if 2:\n\t#pass\n#pass\n' |
|||
msg245563 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-20 15:14 | |
@gumblex: This is a good start. It certainly provides a candidate implementation. First, can I suggest that you remove the changes pertinent to the "at least two elements" and address that in a separate ticket/discussion? Second, any patch will necessarily need to include a test that fails before the patch and passes after the test. Third, and least important, the proposed implementation adds a lot of branching and cognitive complexity to an already complicated function. I'd like to work on the implementation to see if we can make it easier to comprehend. That said, practicality beats purity, so if it passes the test, then it's surely acceptable. |
|||
msg245576 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-20 23:58 | |
I've created a repo clone and have added a version of Terry's test to it and will now test Dingyuan's patch. |
|||
msg245577 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-21 00:14 | |
I've committed the patch without the change for "at least two elements" as https://bitbucket.org/jaraco/cpython-issue20387/commits/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2. The patch corrects the new test, demonstrating its effectiveness, but yields two new test failures (apparent regressions): $ ./python.exe Lib/test/test_tokenize.py ********************************************************************** File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests Failed example: roundtrip("try: import somemodule\n" "except ImportError: # comment\n" " print('Can not import' # comment2\n)" "else: print('Loaded')\n") Exception raised: Traceback (most recent call last): File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run compileflags, 1), test.globs) File "<doctest test.test_tokenize.__test__.doctests[14]>", line 1, in <module> roundtrip("try: import somemodule\n" File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip bytes_from5 = untokenize(tokens5) File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize out = ut.untokenize(iterable) File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize self.add_whitespace(start) File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace .format(row, col, self.prev_row, self.prev_col)) ValueError: start (4,1) precedes previous end (4,4) ********************************************************************** File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests Failed example: for testfile in testfiles: if not roundtrip(open(testfile, 'rb')): print("Roundtrip failed for file %s" % testfile) break else: True Exception raised: Traceback (most recent call last): File "/public/cpython-issue20387/Lib/doctest.py", line 1318, in __run compileflags, 1), test.globs) File "<doctest test.test_tokenize.__test__.doctests[70]>", line 2, in <module> if not roundtrip(open(testfile, 'rb')): File "/public/cpython-issue20387/Lib/test/test_tokenize.py", line 698, in roundtrip bytes_from5 = untokenize(tokens5) File "/public/cpython-issue20387/Lib/tokenize.py", line 339, in untokenize out = ut.untokenize(iterable) File "/public/cpython-issue20387/Lib/tokenize.py", line 273, in untokenize self.add_whitespace(start) File "/public/cpython-issue20387/Lib/tokenize.py", line 236, in add_whitespace .format(row, col, self.prev_row, self.prev_col)) ValueError: start (73,8) precedes previous end (73,12) ********************************************************************** 1 items had failures: 2 of 74 in test.test_tokenize.__test__.doctests ***Test Failed*** 2 failures. Traceback (most recent call last): File "Lib/test/test_tokenize.py", line 1261, in <module> test_main() File "Lib/test/test_tokenize.py", line 1252, in test_main support.run_doctest(test_tokenize, True) File "/public/cpython-issue20387/Lib/test/support/__init__.py", line 1854, in run_doctest raise TestFailed("%d of %d doctests failed" % (f, t)) test.support.TestFailed: 2 of 79 doctests failed @gumblex, Can you review the failures and address them? |
|||
msg245613 - (view) | Author: Dingyuan Wang (gumblex) * | Date: 2015-06-22 02:33 | |
The new patch should now pass all tests correctly. The main idea is: * if the token is INDENT, push it on the `indents` stack and continue * if a new line starts, AND the position of the first token >= the length of the last indent level, we assume the indent is contained in the leading whitespaces. The new test_tokenize.py fails: https://bitbucket.org/jaraco/cpython-issue20387/src/b7fe3c865b8dbdb33d26f4bc5cbb6096f5445fb2/Lib/test/test_tokenize.py?at=3.4#cl-1244 Line 1244 should be: codelines = self.roundtrip(code).split(b'\n') It seems that the tokens generated by tokenize.tokenize don't contain enough information to restore the original file. * Tabs between tokens are not preserved. * Space before backslash as line continuation are not preserved. (From test/tokenize_tests.txt) # Backslash means line continuation: -x = 1 \ +x = 1\ + 1 My roundtrip test code copied here from #24447: python2 -c 'import sys, tokenize; sys.stdout.write(tokenize.untokenize(tokenize.generate_tokens(sys.stdin.readline)))' python3 -c 'import sys, tokenize; sys.stdout.buffer.write(tokenize.untokenize(tokenize.tokenize(sys.stdin.buffer.readline)))' |
|||
msg245834 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-26 03:12 | |
@gumblex, I've applied your updated patch (though I couldn't figure out why it wouldn't apply mechanically; I had to paste it). I also corrected the test. Thanks for the advice on that. I don't understand the second half of your message. Are you stating caveats for our edification? Do you have outstanding concerns? Can you confirm your original report is corrected by the latest tip in the patch repository? I believe what remains to be done is to add a NEWS entry and merge the changes with the 3.4, 3.5, and default branches. If a 2.7 backport is required, that could be done also. |
|||
msg245856 - (view) | Author: Dingyuan Wang (gumblex) * | Date: 2015-06-26 15:21 | |
I mean the patch only restores tabs in indentation. The reports above should be corrected. Tabs between tokens and other race conditions can't be restored exactly providing the token stream. This won't affect the syntax. I wonder if it's also a bug or a wont-fix in tokenize/untokenize roundtrip because fixing it involves adding whitespace information in the token stream. (I can't pull down the whole hg repo, so the patch is manually created by just diffing two versions.) |
|||
msg245908 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-06-28 15:17 | |
New changeset b784c842a63c by Jason R. Coombs in branch '3.4': Issue #20387: Add test capturing failure to roundtrip indented code in tokenize module. https://hg.python.org/cpython/rev/b784c842a63c New changeset 49323e5f6391 by Jason R. Coombs in branch '3.4': Issue #20387: Correct test to properly capture expectation. https://hg.python.org/cpython/rev/49323e5f6391 New changeset ff47efeeed48 by Dingyuan Wang in branch '3.4': Issue #20387: Restore retention of indentation during untokenize. https://hg.python.org/cpython/rev/ff47efeeed48 New changeset 4856ae883041 by Jason R. Coombs in branch '3.4': Issue #20387: Update Misc/NEWS https://hg.python.org/cpython/rev/4856ae883041 New changeset 330e28b28334 by Jason R. Coombs in branch '3.4': Issue #20387: Merge patch and test https://hg.python.org/cpython/rev/330e28b28334 New changeset 9ce5c1f371f7 by Jason R. Coombs in branch '3.4': Issue #20387: Merge https://hg.python.org/cpython/rev/9ce5c1f371f7 New changeset 98380a6e037c by Jason R. Coombs in branch '3.5': Issue #20387: Merge test and patch from 3.4.4 https://hg.python.org/cpython/rev/98380a6e037c New changeset f2f5d1c928eb by Jason R. Coombs in branch 'default': Issue #20387: Merge with 3.5 https://hg.python.org/cpython/rev/f2f5d1c928eb |
|||
msg245910 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-28 15:47 | |
Patch and test applied to 3.4+. I'm inclined to backport this to Python 2.7, as that was where I encountered it originally. |
|||
msg245911 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-28 15:49 | |
Benjamin, any objections to a backport of this patch? |
|||
msg245912 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-06-28 17:09 | |
New changeset 524a0e755797 by Jason R. Coombs in branch '2.7': Issue #20387: Backport test from Python 3.4 https://hg.python.org/cpython/rev/524a0e755797 New changeset cb9df1ae287b by Jason R. Coombs in branch '2.7': Issue #20387: Backport fix from Python 3.4 https://hg.python.org/cpython/rev/cb9df1ae287b |
|||
msg245913 - (view) | Author: Jason R. Coombs (jaraco) * | Date: 2015-06-28 17:11 | |
For the sake of expediency, I've gone ahead and backported and pushed the fix to 2.7. Please back out the changes if appropriate. |
|||
msg254862 - (view) | Author: Martin Panter (martin.panter) * | Date: 2015-11-18 23:07 | |
It seems the problem with tabs in the indentation is fixed. Can we close this? For the problem with tabs between tokens in a line, I think that would be better handled in separate report. But I suspect it is at worst a documentation problem. You can’t guarantee that you can reconstruct the exact source based off just the tokenized tokens. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:57 | admin | set | github: 64586 |
2015-11-19 00:53:07 | jaraco | set | status: open -> closed |
2015-11-18 23:07:58 | martin.panter | set | nosy:
+ martin.panter messages: + msg254862 |
2015-06-28 17:11:10 | jaraco | set | resolution: fixed messages: + msg245913 |
2015-06-28 17:09:01 | python-dev | set | messages: + msg245912 |
2015-06-28 15:49:17 | jaraco | set | nosy:
+ benjamin.peterson messages: + msg245911 |
2015-06-28 15:47:54 | jaraco | set | messages:
+ msg245910 versions: + Python 3.5, - Python 3.3 |
2015-06-28 15:17:53 | python-dev | set | nosy:
+ python-dev messages: + msg245908 |
2015-06-26 15:21:44 | gumblex | set | messages: + msg245856 |
2015-06-26 03:12:26 | jaraco | set | messages:
+ msg245834 stage: needs patch -> patch review |
2015-06-22 02:33:13 | gumblex | set | files:
+ tokenize.patch messages: + msg245613 |
2015-06-21 00:14:44 | jaraco | set | messages: + msg245577 |
2015-06-20 23:58:18 | jaraco | set | assignee: jaraco messages: + msg245576 hgrepos: + hgrepo313 |
2015-06-20 15:14:02 | jaraco | set | messages: + msg245563 |
2015-06-20 07:13:46 | gumblex | set | files:
+ tokenize.patch nosy: + gumblex messages: + msg245547 keywords: + patch |
2015-06-19 19:33:18 | terry.reedy | set | messages: + msg245515 |
2015-06-19 19:32:43 | terry.reedy | link | issue24447 superseder |
2014-02-02 10:30:38 | terry.reedy | set | messages: + msg209967 |
2014-02-02 10:08:37 | terry.reedy | set | messages: + msg209965 |
2014-02-02 08:30:17 | terry.reedy | set | stage: needs patch messages: + msg209964 versions: + Python 3.3 |
2014-02-02 04:42:54 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg209946 |
2014-01-24 23:08:28 | Arfrever | set | nosy:
+ Arfrever |
2014-01-24 22:51:15 | jaraco | create |