Issue 12691: tokenize.untokenize is broken

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56900

classification

Title:	tokenize.untokenize is broken
Type:	behavior	Stage:	resolved
Components:	Documentation, Library (Lib), Tests	Versions:	Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	terry.reedy	Nosy List:	benjamin.peterson, daniel.urban, docs@python, eric.snow, ezio.melotti, gdr@garethrees.org, meador.inge, python-dev, r.david.murray, sandro.tosi, serhiy.storchaka, takluyver, terry.reedy, yselivanov
Priority:	normal	Keywords:	patch

Created on 2011-08-04 22:21 by gdr@garethrees.org, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Issue12691.patch	gdr@garethrees.org, 2011-08-05 23:04		review
Issue12691.patch	gdr@garethrees.org, 2011-08-06 00:17		review
Issue12691.patch	gdr@garethrees.org, 2014-02-05 10:59		review

Messages (22)
msg141634 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2011-08-04 22:21
tokenize.untokenize is completely broken. Python 3.2.1 (default, Jul 19 2011, 00:09:43) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tokenize, io >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline)) >>> tokenize.untokenize(t) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 250, in untokenize out = ut.untokenize(iterable) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 179, in untokenize self.add_whitespace(start) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 165, in add_whitespace assert row <= self.prev_row AssertionError The assertion is simply bogus: the <= should be >=. The reason why no-one has spotted this is that the unit tests for the tokenize module only ever call untokenize() in "compatibility" mode, passing in a 2-tuple instead of a 5-tuple. I propose to fix this, and add unit tests, at the same time as fixing other problems with tokenize.py (issue12675).
msg141636 - (view)	Author: Sandro Tosi (sandro.tosi) *	Date: 2011-08-04 22:32
Hi Gareth, would you like to provide a patch to fix the bug you spotted and add the relative case into the testsuite?
msg141638 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2011-08-04 23:01
See my last paragraph: I propose to deliver a single patch that fixes both this bug and issue12675. I hope this is OK. (If you prefer, I'll try to split the patch in two.) I just noticed another bug in untokenize(): in compatibility mode, if untokenize() is passed an iterator rather than a list, then the first token gets discarded: Python 3.3.0a0 (default:c099ba0a278e, Aug 2 2011, 12:35:03) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from tokenize import untokenize >>> from token import * >>> untokenize([(NAME, 'hello')]) 'hello ' >>> untokenize(iter([(NAME, 'hello')])) '' No-one's noticed this because the unit tests only ever pass lists to untokenize().
msg141641 - (view)	Author: Sandro Tosi (sandro.tosi) *	Date: 2011-08-04 23:22
The general rule would be to have separate patches. But in this case, if we have interdipendent changes, then those should be "packed" in a single patch (f.e. if changes to tokenize break untokenize, than those parts should be joined).
msg141687 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2011-08-05 20:42
I think I can make these changes independently and issue two patches, one fixing the problems with untokenize listed here, and another improving tokenize. I've just noticed a third bug in untokenize: in full mode, it doesn't handle backslash-continued lines correctly. Python 3.3.0a0 (default:c099ba0a278e, Aug 2 2011, 12:35:03) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from io import BytesIO >>> from tokenize import tokenize, untokenize >>> untokenize(tokenize(BytesIO('1 and \\\n not 2'.encode('utf8')).readline)) b'1 andnot 2'
msg141694 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2011-08-05 23:04
Please find attached a patch containing four bug fixes for untokenize(): * untokenize() now always returns a bytes object, defaulting to UTF-8 if no ENCODING token is found (previously it returned a string in this case). * In compatibility mode, untokenize() successfully processes all tokens from an iterator (previously it discarded the first token). * In full mode, untokenize() now returns successfully (previously it asserted). * In full mode, untokenize() successfully processes tokens that were separated by a backslashed newline in the original source (previously it ran these tokens together). In addition, I've added some unit tests: * Test case for backslashed newline. * Test case for missing ENCODING token. * roundtrip() tests both modes of untokenize() (previously it just tested compatibility mode). and updated the documentation: * Update the docstring for untokenize to better describe its actual behaviour, and remove the false claim "Untokenized source will match input source exactly". (We can restore this claim if we ever fix tokenize/untokenize so that it's true.) * Update the documentation for untokenize in tokenize.rdt to match the docstring. I welcome review: this is my first proper patch to Python.
msg141698 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2011-08-06 00:17
Thanks Ezio for the review. I've made all the changes you requested, (except for the re-ordering of paragraphs in the documentation, which I don't want to do because that would lead to the "round-trip property" being mentioned before it's defined). Revised patch attached.
msg180844 - (view)	Author: Thomas Kluyver (takluyver) *	Date: 2013-01-28 10:54
Is there anything I can do to push this forwards? I'm trying to use tokenize and untokenize in IPython, and for now I'm going to have to maintain our own copies of it (for Python 2 and 3), because I keep running into problems with the standard library module.
msg181243 - (view)	Author: Meador Inge (meador.inge) *	Date: 2013-02-03 05:07
I will take a look. As it stands the current patch fixes way too many issues. Patches should fix one issue at a time. I will look at fixing the original assert problem and at opening new issues for the others (assuming there aren't already issues for them).
msg209839 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2014-01-31 22:24
bump?
msg210283 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2014-02-05 01:52
Yury, let me see if I can move this issue forward. I clearly haven't done a good job of explaining these problems, how they are related, and why it makes sense to solve them together, so let me have a go now. 1. tokenize.untokenize() raises AssertionError if you pass it a sequence of tokens output from tokenize.tokenize(). This was my original problem report, and it's still not fixed in Python 3.4: Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import tokenize, io >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline)) >>> tokenize.untokenize(t) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize out = ut.untokenize(iterable) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize self.add_whitespace(start) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace assert row <= self.prev_row AssertionError This defeats any attempt to use the sequence: input code -> tokenize -> transform -> untokenize -> output code to transform Python code. But this ought to be the main use case for the untokenize function! That's how I came across the problem in the first place, when I was starting to write Minipy <https://github.com/gareth-rees/minipy>. 2. Fixing problem #1 is easy (just swap <= for >=), but it raises the question: why wasn't this mistake caught by test_tokenize? There's a test function roundtrip() whose docstring says: Test roundtrip for `untokenize`. `f` is an open file or a string. The source code in f is tokenized, converted back to source code via tokenize.untokenize(), and tokenized again from the latter. The test fails if the second tokenization doesn't match the first. If I don't fix the problem with roundtrip(), then how can I be sure I have fixed the problem? Clearly it's necessary to fix the test case and establish that it provokes the assertion. So why doesn't roundtrip() detect the error? Well, it turns out that tokenize.untokenize() has two modes of operation and roundtrip() only tests one of them. The documentation for tokenize.untokenize() is rather cryptic, and all it says is: Each element returned by the [input] iterable must be a token sequence with at least two elements, a token number and token value. If only two tokens are passed, the resulting output is poor. By reverse-engineering the implementation, it seems that it has two modes of operation. In the first mode (which I have called "compatibility" mode after the method Untokenizer.compat() that implements it) you pass it tokens in the form of 2-element tuples (type, text). These must have exactly 2 elements. In the second mode (which I have called "full" mode based on the description "full input" in the docstring) you pass it tokens in the form of tuples with 5 elements (type, text, start, end, line). These are compatible with the namedtuples returned from tokenize.tokenize(). The "full" mode has the buggy assertion, but test_tokenize.roundtrip() only tests the "compatibility" mode. So I must (i) fix roundtrip() so that it tests both modes; (ii) improve the documentation for tokenize.untokenize() so that programmers have some chance of figuring this out in future! 3. As soon as I make roundtrip() test both modes it provokes the assertion failure. Good, so I can fix the assertion. Problem #1 solved. But now there are test failures in "full" mode: $ ./python.exe -m test test_tokenize [1/1] test_tokenize ******************************************************************** File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests Failed example: for testfile in testfiles: if not roundtrip(open(testfile, 'rb')): print("Roundtrip failed for file %s" % testfile) break else: True Expected: True Got: Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py ****************************************************************** 1 items had failures: 1 of 73 in test.test_tokenize.__test__.doctests Test Failed** 1 failures. test test_tokenize failed -- 1 of 78 doctests failed 1 test failed: test_tokenize Examination of the failed tokenization shows that if the source contains a backslashed-newline, then in "full" mode Untokenize.add_whitespace() fails to leave any space between the token before the blackslash-newline and the token afterwards. The trouble is that backslash-newlines don't result in a token from tokenize(). But in update Untokenize.add_whitespace() I can deduce that there must have been a backslash-newline whenever I see the first token on a new line (other than a DEDENT or ENDMARKER) and the previous token was not a NEWLINE or NL. 4. OK, with problems 1-3 fixed, shouldn't I now actually test the documented round-trip property fully? The property that roundtrip() currently tests is: tokenize(untokenize(tokenize(code))) == tokenize(code) which is all that is guaranteed in "compatibility" mode. But the docstring for tokenize.untokenize() says: Round-trip invariant for full input: Untokenized source will match input source exactly So in "full" mode it seems that the following is guaranteed: untokenize(tokenize(code)) == code But if I add a test for this stronger round-trip property, it immediately fails. The docstring is not telling the truth! 5. Why doesn't the "full" round-trip property hold? It seems that there are (at least) two separate problems: i. tokenize() doesn't output a token for a backslash-newline so although untokenize() can deduce that there must have been one, it can't know what whitespace there was before it. ii. In compatibility mode, the first token is always discarded. If the tokens came from tokenize.tokenize(), then this is the ENCODING token, so the encoding is never applied. (Benjamin Peterson had previously discovered this problem but just suppressed the failing test case instead of reporting it: <http://hg.python.org/cpython/rev/c13abed5d764>) Looking through this sequence of woes again, it's clear that I can save problems 4 and 5 for another issue. But I am confident that problems 1-3 must be fixed together. So although Meador Inge said above that "the current patch fixes way too many issues", I hope that if he reads the argument above he will see why I did what I did: this is the minimum sensible change I know how to make. I have prepared a revised patch that applies to the current head. Please review it: I'll do my best to fix all the problems you find.
msg210299 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2014-02-05 10:59
This morning I noticed that I had forgotten to update the library reference, and I also noticed two more problems to add to the list above: 6. Although Lib/test/test_tokenize.py looks like it contains tests for backslash-newline handling, these tests are ineffective. Here they are: >>> roundtrip("x=1+\\\\n" ... "1\\n" ... "# This is a comment\\\\n" ... "# This also\\n") True >>> roundtrip("# Comment \\\\nx = 0") True There are two problems here: (i) because of the double string escaping, these are not backslash-newline, they are backslash-n. (ii) the roundtrip() test is too weak to detect this problem: tokenize() outputs an ERRORTOKEN for the backslash and untokenize() restores it. So the round-trip property is satisfied. 7. Problem 6 shows the difficulty of using doctests for this kind of test. It would be easier to ensure the correctness of these tests if the docstring was read from a separate file, so that at least the tests only need one level of string escaping. I fixed problem 6 by updating these tests to use dump_tokens() instead of roundtrip(). I have not fixed problem 7 (like 4 and 5, I can leave it for another issue). Revised patch attached.
msg210339 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2014-02-05 19:06
Gareth, Thanks a lot for such a comprehensive writeup and the patch. Please give me a day or two to do the review.
msg210404 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2014-02-06 15:50
I did some research on the cause of this issue. The assertion was added in this change by Jeremy Hylton in August 2006: <https://mail.python.org/pipermail/python-checkins/2006-August/055812.html> (The corresponding Mercurial commit is here: <http://hg.python.org/cpython/rev/cc992d75d5b3#l217.25>). At that point I believe the assertion was reasonable. I think it would have been triggered by backslash-continued lines, but otherwise it worked. But in this change <http://hg.python.org/cpython/rev/51e24512e305> in March 2008 Trent Nelson applied this patch by Michael Foord <http://bugs.python.org/file9741/tokenize_patch.diff> to implement PEP 263 and fix issue719888. The patch added ENCODING tokens to the output of tokenize.tokenize(). The ENCODING token is always generated with row number 0, while the first actual token is generated with row number 1. So now every token stream from tokenize.tokenize() sets off the assertion. The lack of a test case for tokenize.untokenize() in "full" mode meant that it was (and is) all too easy for someone to accidentally break it like this.
msg211449 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-17 21:50
New changeset c896d292080a by Terry Jan Reedy in branch '2.7': Untokenize: An logically incorrect assert tested user input validity. http://hg.python.org/cpython/rev/c896d292080a New changeset 51e5a89afb3b by Terry Jan Reedy in branch '3.3': Untokenize: An logically incorrect assert tested user input validity. http://hg.python.org/cpython/rev/51e5a89afb3b
msg211470 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2014-02-18 02:33
The problem of the first iterator pair token being discarded is the subject of #8478. Consider that part of this issue as being closed as a duplicate. The issue of a string being returned if there is no encoding should have been opened as a separate issue, and it was in #16223. I explain there why I think the behavior should be left as is and the docstring changed, and the doc clarified.
msg211544 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2014-02-18 20:20
I fixed the assert and dropped first iter compat-mode token bugs one-by-one by writing narrow unittests that fail and code that makes them pass. I am now working on the '\' continuation issue. That is the subject of #9974, which has a nearly identical patch. Where it differs from yours, I will choose on the basis of tests. Any further discussion of this bug should be on that issue. I appreciate the warning that the full mode is undertested, so I need to be concerned about breaking untested functionality that works. That was not so much a concern with the first two issues.
msg211560 - (view)	Author: Gareth Rees (gdr@garethrees.org) *	Date: 2014-02-18 21:16
Thanks for your work on this, Terry. I apologise for the complexity of my original report, and will try not to do it again.
msg212042 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-23 23:01
New changeset 8d6dd02a973f by Terry Jan Reedy in branch '3.3': Issue #20750, Enable roundtrip tests for new 5-tuple untokenize. The http://hg.python.org/cpython/rev/8d6dd02a973f
msg212059 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-24 04:40
New changeset 0f0e9b7d4f1d by Terry Jan Reedy in branch '2.7': Issue #9974: When untokenizing, use row info to insert backslash+newline. http://hg.python.org/cpython/rev/0f0e9b7d4f1d New changeset 24b4cd5695d9 by Terry Jan Reedy in branch '3.3': Issue #9974: When untokenizing, use row info to insert backslash+newline. http://hg.python.org/cpython/rev/24b4cd5695d9
msg266714 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-05-30 19:09
Is there anything left to do here?
msg266715 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-05-30 19:33
If there are, I can't remember. This was one or 7 or 8 untokenize issues with about 5 separate bugs between them. If there are any current untokenize issues not covered by some other open issue, then a new issue should be opened.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56900
2016-05-30 19:33:24	terry.reedy	set	status: open -> closed resolution: fixed messages: + msg266715 stage: patch review -> resolved
2016-05-30 19:09:48	r.david.murray	set	nosy: + r.david.murray messages: + msg266714
2014-02-24 04:40:41	python-dev	set	messages: + msg212059
2014-02-23 23:01:28	python-dev	set	messages: + msg212042
2014-02-18 21:16:23	gdr@garethrees.org	set	messages: + msg211560
2014-02-18 20:20:02	terry.reedy	set	messages: + msg211544
2014-02-18 02:33:32	terry.reedy	set	messages: + msg211470
2014-02-17 21:50:20	python-dev	set	nosy: + python-dev messages: + msg211449
2014-02-17 21:18:03	terry.reedy	set	assignee: docs@python -> terry.reedy stage: test needed -> patch review nosy: + terry.reedy versions: + Python 2.7, Python 3.3
2014-02-06 15:50:03	gdr@garethrees.org	set	messages: + msg210404
2014-02-05 19:06:26	yselivanov	set	messages: + msg210339
2014-02-05 18:38:02	gdr@garethrees.org	set	files: - Issue12691.patch
2014-02-05 11:40:18	gdr@garethrees.org	set	assignee: docs@python components: + Documentation, Tests nosy: + docs@python
2014-02-05 10:59:04	gdr@garethrees.org	set	files: + Issue12691.patch messages: + msg210299
2014-02-05 01:54:40	gdr@garethrees.org	set	nosy: + benjamin.peterson
2014-02-05 01:52:52	gdr@garethrees.org	set	files: + Issue12691.patch messages: + msg210283
2014-01-31 22:24:27	yselivanov	set	nosy: + yselivanov messages: + msg209839 versions: + Python 3.4, - Python 3.2, Python 3.3
2013-02-03 05:07:35	meador.inge	set	messages: + msg181243
2013-01-28 10:54:13	takluyver	set	nosy: + takluyver messages: + msg180844
2012-12-29 05:25:46	meador.inge	set	nosy: + meador.inge
2012-10-15 21:07:13	serhiy.storchaka	set	nosy: + serhiy.storchaka
2011-08-06 00:17:56	gdr@garethrees.org	set	files: + Issue12691.patch messages: + msg141698
2011-08-05 23:04:04	gdr@garethrees.org	set	files: + Issue12691.patch keywords: + patch messages: + msg141694
2011-08-05 20:42:41	gdr@garethrees.org	set	messages: + msg141687
2011-08-05 19:33:13	daniel.urban	set	nosy: + daniel.urban
2011-08-04 23:27:34	eric.snow	set	nosy: + eric.snow
2011-08-04 23:22:52	sandro.tosi	set	messages: + msg141641
2011-08-04 23:01:00	gdr@garethrees.org	set	messages: + msg141638
2011-08-04 22:32:14	sandro.tosi	set	nosy: + sandro.tosi messages: + msg141636
2011-08-04 22:22:27	ezio.melotti	set	nosy: + ezio.melotti stage: test needed
2011-08-04 22:21:28	gdr@garethrees.org	create