classification
Title: tokenize.untokenize is broken
Type: behavior Stage: resolved
Components: Documentation, Library (Lib), Tests Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: terry.reedy Nosy List: benjamin.peterson, daniel.urban, docs@python, eric.snow, ezio.melotti, gdr@garethrees.org, meador.inge, python-dev, r.david.murray, sandro.tosi, serhiy.storchaka, takluyver, terry.reedy, yselivanov
Priority: normal Keywords: patch

Created on 2011-08-04 22:21 by gdr@garethrees.org, last changed 2016-05-30 19:33 by terry.reedy. This issue is now closed.

Files
File name Uploaded Description Edit
Issue12691.patch gdr@garethrees.org, 2011-08-05 23:04 review
Issue12691.patch gdr@garethrees.org, 2011-08-06 00:17 review
Issue12691.patch gdr@garethrees.org, 2014-02-05 10:59 review
Messages (22)
msg141634 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-04 22:21
tokenize.untokenize is completely broken.

    Python 3.2.1 (default, Jul 19 2011, 00:09:43) 
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tokenize, io
    >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
    >>> tokenize.untokenize(t)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 250, in untokenize
        out = ut.untokenize(iterable)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 179, in untokenize
        self.add_whitespace(start)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py", line 165, in add_whitespace
        assert row <= self.prev_row
    AssertionError

The assertion is simply bogus: the <= should be >=.

The reason why no-one has spotted this is that the unit tests for the tokenize module only ever call untokenize() in "compatibility" mode, passing in a 2-tuple instead of a 5-tuple.

I propose to fix this, and add unit tests, at the same time as fixing other problems with tokenize.py (issue12675).
msg141636 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2011-08-04 22:32
Hi Gareth, would you like to provide a patch to fix the bug you spotted and add the relative case into the testsuite?
msg141638 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-04 23:01
See my last paragraph: I propose to deliver a single patch that fixes both this bug and issue12675. I hope this is OK. (If you prefer, I'll try to split the patch in two.)

I just noticed another bug in untokenize(): in compatibility mode, if untokenize() is passed an iterator rather than a list, then the first token gets discarded:

    Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from tokenize import untokenize
    >>> from token import *
    >>> untokenize([(NAME, 'hello')])
    'hello '
    >>> untokenize(iter([(NAME, 'hello')]))
    ''

No-one's noticed this because the unit tests only ever pass lists to untokenize().
msg141641 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2011-08-04 23:22
The general rule would be to have separate patches. But in this case, if we have interdipendent changes, then those should be "packed" in a single patch (f.e. if changes to tokenize break untokenize, than those parts should be joined).
msg141687 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-05 20:42
I think I can make these changes independently and issue two patches, one fixing the problems with untokenize listed here, and another improving tokenize.

I've just noticed a third bug in untokenize: in full mode, it doesn't handle backslash-continued lines correctly.

    Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from io import BytesIO
    >>> from tokenize import tokenize, untokenize
    >>> untokenize(tokenize(BytesIO('1 and \\\n not 2'.encode('utf8')).readline))
    b'1 andnot 2'
msg141694 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-05 23:04
Please find attached a patch containing four bug fixes for untokenize():

* untokenize() now always returns a bytes object, defaulting to UTF-8 if no ENCODING token is found (previously it returned a string in this case).
* In compatibility mode, untokenize() successfully processes all tokens from an iterator (previously it discarded the first token).
* In full mode, untokenize() now returns successfully (previously it asserted).
* In full mode, untokenize() successfully processes tokens that were separated by a backslashed newline in the original source (previously it ran these tokens together).

In addition, I've added some unit tests:

* Test case for backslashed newline.
* Test case for missing ENCODING token.
* roundtrip() tests both modes of untokenize() (previously it just tested compatibility mode).

and updated the documentation:

* Update the docstring for untokenize to better describe its actual behaviour, and remove the false claim "Untokenized source will match input source exactly". (We can restore this claim if we ever fix tokenize/untokenize so that it's true.)
* Update the documentation for untokenize in tokenize.rdt to match the docstring.

I welcome review: this is my first proper patch to Python.
msg141698 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2011-08-06 00:17
Thanks Ezio for the review. I've made all the changes you requested, (except for the re-ordering of paragraphs in the documentation, which I don't want to do because that would lead to the "round-trip property" being mentioned before it's defined). Revised patch attached.
msg180844 - (view) Author: Thomas Kluyver (takluyver) * Date: 2013-01-28 10:54
Is there anything I can do to push this forwards? I'm trying to use tokenize and untokenize in IPython, and for now I'm going to have to maintain our own copies of it (for Python 2 and 3), because I keep running into problems with the standard library module.
msg181243 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2013-02-03 05:07
I will take a look.  As it stands the current patch fixes way too many
issues.  Patches should fix *one* issue at a time.  I will look at fixing
the original assert problem and at opening new issues for the others
(assuming there aren't already issues for them).
msg209839 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2014-01-31 22:24
bump?
msg210283 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2014-02-05 01:52
Yury, let me see if I can move this issue forward. I clearly haven't
done a good job of explaining these problems, how they are related,
and why it makes sense to solve them together, so let me have a go
now.

1. tokenize.untokenize() raises AssertionError if you pass it a
   sequence of tokens output from tokenize.tokenize(). This was my
   original problem report, and it's still not fixed in Python 3.4:

      Python 3.4.0b3 (default, Jan 27 2014, 02:26:41) 
      [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import tokenize, io
      >>> t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
      >>> tokenize.untokenize(t)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 317, in untokenize
          out = ut.untokenize(iterable)
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 246, in untokenize
          self.add_whitespace(start)
        File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/tokenize.py", line 232, in add_whitespace
          assert row <= self.prev_row
      AssertionError

   This defeats any attempt to use the sequence:

      input code -> tokenize -> transform -> untokenize -> output code

   to transform Python code. But this ought to be the main use case
   for the untokenize function! That's how I came across the problem
   in the first place, when I was starting to write Minipy
   <https://github.com/gareth-rees/minipy>.

2. Fixing problem #1 is easy (just swap <= for >=), but it raises the
   question: why wasn't this mistake caught by test_tokenize? There's
   a test function roundtrip() whose docstring says:

      Test roundtrip for `untokenize`. `f` is an open file or a
      string. The source code in f is tokenized, converted back to
      source code via tokenize.untokenize(), and tokenized again from
      the latter. The test fails if the second tokenization doesn't
      match the first.

   If I don't fix the problem with roundtrip(), then how can I be
   sure I have fixed the problem? Clearly it's necessary to fix the
   test case and establish that it provokes the assertion.

   So why doesn't roundtrip() detect the error? Well, it turns out
   that tokenize.untokenize() has two modes of operation and
   roundtrip() only tests one of them.

   The documentation for tokenize.untokenize() is rather cryptic, and
   all it says is:

      Each element returned by the [input] iterable must be a token
      sequence with at least two elements, a token number and token
      value. If only two tokens are passed, the resulting output is
      poor.

   By reverse-engineering the implementation, it seems that it has two
   modes of operation.

   In the first mode (which I have called "compatibility" mode after
   the method Untokenizer.compat() that implements it) you pass it
   tokens in the form of 2-element tuples (type, text). These must
   have exactly 2 elements.

   In the second mode (which I have called "full" mode based on the
   description "full input" in the docstring) you pass it tokens in
   the form of tuples with 5 elements (type, text, start, end, line).
   These are compatible with the namedtuples returned from
   tokenize.tokenize().

   The "full" mode has the buggy assertion, but
   test_tokenize.roundtrip() only tests the "compatibility" mode.

   So I must (i) fix roundtrip() so that it tests both modes; (ii)
   improve the documentation for tokenize.untokenize() so that
   programmers have some chance of figuring this out in future!

3. As soon as I make roundtrip() test both modes it provokes the
   assertion failure. Good, so I can fix the assertion. Problem #1
   solved.

   But now there are test failures in "full" mode:

      $ ./python.exe -m test test_tokenize
      [1/1] test_tokenize
      **********************************************************************
      File "/Users/gdr/hg.python.org/cpython/Lib/test/test_tokenize.py", line ?, in test.test_tokenize.__test__.doctests
      Failed example:
          for testfile in testfiles:
              if not roundtrip(open(testfile, 'rb')):
                  print("Roundtrip failed for file %s" % testfile)
                  break
          else: True
      Expected:
          True
      Got:
          Roundtrip failed for file /Users/gdr/hg.python.org/cpython/Lib/test/test_platform.py
      **********************************************************************
      1 items had failures:
         1 of  73 in test.test_tokenize.__test__.doctests
      ***Test Failed*** 1 failures.
      test test_tokenize failed -- 1 of 78 doctests failed
      1 test failed:
          test_tokenize

   Examination of the failed tokenization shows that if the source
   contains a backslashed-newline, then in "full" mode
   Untokenize.add_whitespace() fails to leave any space between the
   token before the blackslash-newline and the token afterwards.

   The trouble is that backslash-newlines don't result in a token from
   tokenize(). But in update Untokenize.add_whitespace() I can deduce
   that there must have been a backslash-newline whenever I see the
   first token on a new line (other than a DEDENT or ENDMARKER) and
   the previous token was not a NEWLINE or NL.

4. OK, with problems 1-3 fixed, shouldn't I now actually test the
   documented round-trip property fully? The property that roundtrip()
   currently tests is:

      tokenize(untokenize(tokenize(code))) == tokenize(code)

   which is all that is guaranteed in "compatibility" mode. But the
   docstring for tokenize.untokenize() says:

      Round-trip invariant for full input:
          Untokenized source will match input source exactly

   So in "full" mode it seems that the following is guaranteed:

      untokenize(tokenize(code)) == code

   But if I add a test for this stronger round-trip property, it
   immediately fails. The docstring is not telling the truth!

5. Why doesn't the "full" round-trip property hold? It seems that
   there are (at least) two separate problems:

   i. tokenize() doesn't output a token for a backslash-newline so
      although untokenize() can deduce that there must have been one,
      it can't know what whitespace there was before it.

   ii. In compatibility mode, the first token is always discarded. If
      the tokens came from tokenize.tokenize(), then this is the
      ENCODING token, so the encoding is never applied.

      (Benjamin Peterson had previously discovered this problem but
      just suppressed the failing test case instead of reporting it:
      <http://hg.python.org/cpython/rev/c13abed5d764>)

Looking through this sequence of woes again, it's clear that I can
save problems 4 and 5 for another issue. But I am confident that
problems 1-3 must be fixed together. So although Meador Inge said
above that "the current patch fixes way too many issues", I hope that
if he reads the argument above he will see why I did what I did: this
is the minimum sensible change I know how to make.

I have prepared a revised patch that applies to the current head.
Please review it: I'll do my best to fix all the problems you find.
msg210299 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2014-02-05 10:59
This morning I noticed that I had forgotten to update the library
reference, and I also noticed two more problems to add to the list
above:

6. Although Lib/test/test_tokenize.py looks like it contains tests for
   backslash-newline handling, these tests are ineffective. Here they
   are:

      >>> roundtrip("x=1+\\\\n"
      ...           "1\\n"
      ...           "# This is a comment\\\\n"
      ...           "# This also\\n")
      True
      >>> roundtrip("# Comment \\\\nx = 0")
      True

  There are two problems here: (i) because of the double string
  escaping, these are not backslash-newline, they are backslash-n.
  (ii) the roundtrip() test is too weak to detect this problem:
  tokenize() outputs an ERRORTOKEN for the backslash and untokenize()
  restores it. So the round-trip property is satisfied.

7. Problem 6 shows the difficulty of using doctests for this kind of
   test. It would be easier to ensure the correctness of these tests
   if the docstring was read from a separate file, so that at least
   the tests only need one level of string escaping.

I fixed problem 6 by updating these tests to use dump_tokens() instead
of roundtrip(). I have not fixed problem 7 (like 4 and 5, I can leave
it for another issue). Revised patch attached.
msg210339 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2014-02-05 19:06
Gareth,

Thanks a lot for such a comprehensive writeup and the patch. Please give me a day or two to do the review.
msg210404 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2014-02-06 15:50
I did some research on the cause of this issue. The assertion was
added in this change by Jeremy Hylton in August 2006:
<https://mail.python.org/pipermail/python-checkins/2006-August/055812.html>
(The corresponding Mercurial commit is here:
<http://hg.python.org/cpython/rev/cc992d75d5b3#l217.25>).

At that point I believe the assertion was reasonable. I think it would
have been triggered by backslash-continued lines, but otherwise it
worked.

But in this change <http://hg.python.org/cpython/rev/51e24512e305> in
March 2008 Trent Nelson applied this patch by Michael Foord
<http://bugs.python.org/file9741/tokenize_patch.diff> to implement PEP
263 and fix issue719888. The patch added ENCODING tokens to the output
of tokenize.tokenize(). The ENCODING token is always generated with
row number 0, while the first actual token is generated with row
number 1. So now every token stream from tokenize.tokenize() sets off
the assertion.

The lack of a test case for tokenize.untokenize() in "full" mode meant
that it was (and is) all too easy for someone to accidentally break it
like this.
msg211449 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-17 21:50
New changeset c896d292080a by Terry Jan Reedy in branch '2.7':
Untokenize: An logically incorrect assert tested user input validity.
http://hg.python.org/cpython/rev/c896d292080a

New changeset 51e5a89afb3b by Terry Jan Reedy in branch '3.3':
Untokenize: An logically incorrect assert tested user input validity.
http://hg.python.org/cpython/rev/51e5a89afb3b
msg211470 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-18 02:33
The problem of the first iterator pair token being discarded is the subject of #8478. Consider that part of this issue as being closed as a duplicate.

The issue of a string being returned if there is no encoding should have been opened as a separate issue, and it was in #16223. I explain there why I think the behavior should be left as is and the docstring changed, and the doc clarified.
msg211544 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-02-18 20:20
I fixed the assert and dropped first iter compat-mode token bugs one-by-one by writing narrow unittests that fail and code that makes them pass. I am now working on the '\' continuation issue. That is the subject of #9974, which has a nearly identical patch. Where it differs from yours, I will choose on the basis of tests. Any further discussion of this bug should be on that issue.

I appreciate the warning that the full mode is undertested, so I need to be concerned about breaking untested functionality that works. That was not so much a concern with the first two issues.
msg211560 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2014-02-18 21:16
Thanks for your work on this, Terry. I apologise for the complexity of my original report, and will try not to do it again.
msg212042 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-23 23:01
New changeset 8d6dd02a973f by Terry Jan Reedy in branch '3.3':
Issue #20750, Enable roundtrip tests for new 5-tuple untokenize. The
http://hg.python.org/cpython/rev/8d6dd02a973f
msg212059 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-24 04:40
New changeset 0f0e9b7d4f1d by Terry Jan Reedy in branch '2.7':
Issue #9974: When untokenizing, use row info to insert backslash+newline.
http://hg.python.org/cpython/rev/0f0e9b7d4f1d

New changeset 24b4cd5695d9 by Terry Jan Reedy in branch '3.3':
Issue #9974: When untokenizing, use row info to insert backslash+newline.
http://hg.python.org/cpython/rev/24b4cd5695d9
msg266714 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-30 19:09
Is there anything left to do here?
msg266715 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-05-30 19:33
If there are, I can't remember.  This was one or 7 or 8 untokenize issues with about 5 separate bugs between them.  If there are any current untokenize issues not covered by some other open issue, then a new issue should be opened.
History
Date User Action Args
2016-05-30 19:33:24terry.reedysetstatus: open -> closed
resolution: fixed
messages: + msg266715

stage: patch review -> resolved
2016-05-30 19:09:48r.david.murraysetnosy: + r.david.murray
messages: + msg266714
2014-02-24 04:40:41python-devsetmessages: + msg212059
2014-02-23 23:01:28python-devsetmessages: + msg212042
2014-02-18 21:16:23gdr@garethrees.orgsetmessages: + msg211560
2014-02-18 20:20:02terry.reedysetmessages: + msg211544
2014-02-18 02:33:32terry.reedysetmessages: + msg211470
2014-02-17 21:50:20python-devsetnosy: + python-dev
messages: + msg211449
2014-02-17 21:18:03terry.reedysetassignee: docs@python -> terry.reedy
stage: test needed -> patch review

nosy: + terry.reedy
versions: + Python 2.7, Python 3.3
2014-02-06 15:50:03gdr@garethrees.orgsetmessages: + msg210404
2014-02-05 19:06:26yselivanovsetmessages: + msg210339
2014-02-05 18:38:02gdr@garethrees.orgsetfiles: - Issue12691.patch
2014-02-05 11:40:18gdr@garethrees.orgsetassignee: docs@python

components: + Documentation, Tests
nosy: + docs@python
2014-02-05 10:59:04gdr@garethrees.orgsetfiles: + Issue12691.patch

messages: + msg210299
2014-02-05 01:54:40gdr@garethrees.orgsetnosy: + benjamin.peterson
2014-02-05 01:52:52gdr@garethrees.orgsetfiles: + Issue12691.patch

messages: + msg210283
2014-01-31 22:24:27yselivanovsetnosy: + yselivanov

messages: + msg209839
versions: + Python 3.4, - Python 3.2, Python 3.3
2013-02-03 05:07:35meador.ingesetmessages: + msg181243
2013-01-28 10:54:13takluyversetnosy: + takluyver
messages: + msg180844
2012-12-29 05:25:46meador.ingesetnosy: + meador.inge
2012-10-15 21:07:13serhiy.storchakasetnosy: + serhiy.storchaka
2011-08-06 00:17:56gdr@garethrees.orgsetfiles: + Issue12691.patch

messages: + msg141698
2011-08-05 23:04:04gdr@garethrees.orgsetfiles: + Issue12691.patch
keywords: + patch
messages: + msg141694
2011-08-05 20:42:41gdr@garethrees.orgsetmessages: + msg141687
2011-08-05 19:33:13daniel.urbansetnosy: + daniel.urban
2011-08-04 23:27:34eric.snowsetnosy: + eric.snow
2011-08-04 23:22:52sandro.tosisetmessages: + msg141641
2011-08-04 23:01:00gdr@garethrees.orgsetmessages: + msg141638
2011-08-04 22:32:14sandro.tosisetnosy: + sandro.tosi
messages: + msg141636
2011-08-04 22:22:27ezio.melottisetnosy: + ezio.melotti

stage: test needed
2011-08-04 22:21:28gdr@garethrees.orgcreate