This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author gdr@garethrees.org
Recipients gdr@garethrees.org
Date 2011-08-01.12:58:38
SpamBayes Score 9.840906e-12
Marked as misclassified No
Message-id <1312203519.76.0.918580877349.issue12675@psf.upfronthosting.co.za>
In-reply-to
Content
The tokenize module is happy to tokenize Python source code that the real tokenizer would reject. Pretty much any instance where tokenizer.c returns ERRORTOKEN will illustrate this feature. Here are some examples:

    Python 3.3.0a0 (default:2d69900c0820, Aug  1 2011, 13:46:51) 
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from tokenize import generate_tokens
    >>> from io import StringIO
    >>> def tokens(s):
    ...    """Return a string showing the tokens in the string s."""
    ...    return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
    ...
    >>> # Bad exponent
    >>> print(tokens('1if 2else 3'))
    1|if|2|else|3|
    >>> 1if 2else 3
      File "<stdin>", line 1
        1if 2else 3
             ^
    SyntaxError: invalid token
    >>> # Bad hexadecimal constant.
    >>> print(tokens('0xfg'))
    0xf|g|
    >>> 0xfg
      File "<stdin>", line 1
        0xfg
           ^
    SyntaxError: invalid syntax
    >>> # Missing newline after continuation character.
    >>> print(tokens('\\pass'))
    \|pass|
    >>> \pass 
      File "<stdin>", line 1
        \pass
            ^
    SyntaxError: unexpected character after line continuation character

It is surprising that the tokenize module does not yield the same tokens as Python itself, but as this limitation only affects incorrect Python code, perhaps it just needs a mention in the tokenize documentation. Something along the lines of, "The tokenize module generates the same tokens as Python's own tokenizer if it is given correct Python code. However, it may incorrectly tokenize Python code containing syntax errors that the real tokenizer would reject."
History
Date User Action Args
2011-08-01 12:58:39gdr@garethrees.orgsetrecipients: + gdr@garethrees.org
2011-08-01 12:58:39gdr@garethrees.orgsetmessageid: <1312203519.76.0.918580877349.issue12675@psf.upfronthosting.co.za>
2011-08-01 12:58:39gdr@garethrees.orglinkissue12675 messages
2011-08-01 12:58:38gdr@garethrees.orgcreate