Message210283
Yury, let me see if I can move this issue forward. I clearly haven't
done a good job of explaining these problems, how they are related,
and why it makes sense to solve them together, so let me have a go
now.
1. tokenize.untokenize() raises AssertionError if you pass it a
sequence of tokens output from tokenize.tokenize(). This was my
original problem report, and it's still not fixed in Python 3.4:
Python 3.4.0b3 (default, Jan 27 2014, 02:26:41)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tokenize, io
>>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
>>> tokenize.untokenize(t)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize
out = ut.untokenize(iterable)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize
self.add_whitespace(start)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace
assert row <= self.prev_row
AssertionError
This defeats any attempt to use the sequence:
input code -> tokenize -> transform -> untokenize -> output code
to transform Python code. But this ought to be the main use case
for the untokenize function! That's how I came across the problem
in the first place, when I was starting to write Minipy
<https://github.com/gareth-rees/minipy>.
2. Fixing problem #1 is easy (just swap <= for >=), but it raises the
question: why wasn't this mistake caught by test_tokenize? There's
a test function roundtrip() whose docstring says:
Test roundtrip for `untokenize`. `f` is an open file or a
string. The source code in f is tokenized, converted back to
source code via tokenize.untokenize(), and tokenized again from
the latter. The test fails if the second tokenization doesn't
match the first.
If I don't fix the problem with roundtrip(), then how can I be
sure I have fixed the problem? Clearly it's necessary to fix the
test case and establish that it provokes the assertion.
So why doesn't roundtrip() detect the error? Well, it turns out
that tokenize.untokenize() has two modes of operation and
roundtrip() only tests one of them.
The documentation for tokenize.untokenize() is rather cryptic, and
all it says is:
Each element returned by the [input] iterable must be a token
sequence with at least two elements, a token number and token
value. If only two tokens are passed, the resulting output is
poor.
By reverse-engineering the implementation, it seems that it has two
modes of operation.
In the first mode (which I have called "compatibility" mode after
the method Untokenizer.compat() that implements it) you pass it
tokens in the form of 2-element tuples (type, text). These must
have exactly 2 elements.
In the second mode (which I have called "full" mode based on the
description "full input" in the docstring) you pass it tokens in
the form of tuples with 5 elements (type, text, start, end, line).
These are compatible with the namedtuples returned from
tokenize.tokenize().
The "full" mode has the buggy assertion, but
test_tokenize.roundtrip() only tests the "compatibility" mode.
So I must (i) fix roundtrip() so that it tests both modes; (ii)
improve the documentation for tokenize.untokenize() so that
programmers have some chance of figuring this out in future!
3. As soon as I make roundtrip() test both modes it provokes the
assertion failure. Good, so I can fix the assertion. Problem #1
solved.
But now there are test failures in "full" mode:
$ ./python.exe -m test test_tokenize
[1/1] test_tokenize
**********************************************************************
File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
Failed example:
for testfile in testfiles:
if not roundtrip(open(testfile, 'rb')):
print("Roundtrip failed for file %s" % testfile)
break
else: True
Expected:
True
Got:
Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py
**********************************************************************
1 items had failures:
1 of 73 in test.test_tokenize.__test__.doctests
***Test Failed*** 1 failures.
test test_tokenize failed -- 1 of 78 doctests failed
1 test failed:
test_tokenize
Examination of the failed tokenization shows that if the source
contains a backslashed-newline, then in "full" mode
Untokenize.add_whitespace() fails to leave any space between the
token before the blackslash-newline and the token afterwards.
The trouble is that backslash-newlines don't result in a token from
tokenize(). But in update Untokenize.add_whitespace() I can deduce
that there must have been a backslash-newline whenever I see the
first token on a new line (other than a DEDENT or ENDMARKER) and
the previous token was not a NEWLINE or NL.
4. OK, with problems 1-3 fixed, shouldn't I now actually test the
documented round-trip property fully? The property that roundtrip()
currently tests is:
tokenize(untokenize(tokenize(code))) == tokenize(code)
which is all that is guaranteed in "compatibility" mode. But the
docstring for tokenize.untokenize() says:
Round-trip invariant for full input:
Untokenized source will match input source exactly
So in "full" mode it seems that the following is guaranteed:
untokenize(tokenize(code)) == code
But if I add a test for this stronger round-trip property, it
immediately fails. The docstring is not telling the truth!
5. Why doesn't the "full" round-trip property hold? It seems that
there are (at least) two separate problems:
i. tokenize() doesn't output a token for a backslash-newline so
although untokenize() can deduce that there must have been one,
it can't know what whitespace there was before it.
ii. In compatibility mode, the first token is always discarded. If
the tokens came from tokenize.tokenize(), then this is the
ENCODING token, so the encoding is never applied.
(Benjamin Peterson had previously discovered this problem but
just suppressed the failing test case instead of reporting it:
<http://hg.python.org/cpython/rev/c13abed5d764>)
Looking through this sequence of woes again, it's clear that I can
save problems 4 and 5 for another issue. But I am confident that
problems 1-3 must be fixed together. So although Meador Inge said
above that "the current patch fixes way too many issues", I hope that
if he reads the argument above he will see why I did what I did: this
is the minimum sensible change I know how to make.
I have prepared a revised patch that applies to the current head.
Please review it: I'll do my best to fix all the problems you find. |
|
Date |
User |
Action |
Args |
2014-02-05 01:52:53 | gdr@garethrees.org | set | recipients:
+ gdr@garethrees.org, ezio.melotti, meador.inge, daniel.urban, sandro.tosi, eric.snow, takluyver, serhiy.storchaka, yselivanov |
2014-02-05 01:52:52 | gdr@garethrees.org | set | messageid: <1391565172.86.0.464570242122.issue12691@psf.upfronthosting.co.za> |
2014-02-05 01:52:52 | gdr@garethrees.org | link | issue12691 messages |
2014-02-05 01:52:49 | gdr@garethrees.org | create | |
|