Issue12675
Created on 2011-08-01 12:58 by Gareth.Rees, last changed 2011-10-22 19:52 by flox.
| Messages (11) | |||
|---|---|---|---|
| msg141503 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-01 12:58 | |
The tokenize module is happy to tokenize Python source code that the real tokenizer would reject. Pretty much any instance where tokenizer.c returns ERRORTOKEN will illustrate this feature. Here are some examples:
Python 3.3.0a0 (default:2d69900c0820, Aug 1 2011, 13:46:51)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tokenize import generate_tokens
>>> from io import StringIO
>>> def tokens(s):
... """Return a string showing the tokens in the string s."""
... return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
...
>>> # Bad exponent
>>> print(tokens('1if 2else 3'))
1|if|2|else|3|
>>> 1if 2else 3
File "<stdin>", line 1
1if 2else 3
^
SyntaxError: invalid token
>>> # Bad hexadecimal constant.
>>> print(tokens('0xfg'))
0xf|g|
>>> 0xfg
File "<stdin>", line 1
0xfg
^
SyntaxError: invalid syntax
>>> # Missing newline after continuation character.
>>> print(tokens('\\pass'))
\|pass|
>>> \pass
File "<stdin>", line 1
\pass
^
SyntaxError: unexpected character after line continuation character
It is surprising that the tokenize module does not yield the same tokens as Python itself, but as this limitation only affects incorrect Python code, perhaps it just needs a mention in the tokenize documentation. Something along the lines of, "The tokenize module generates the same tokens as Python's own tokenizer if it is given correct Python code. However, it may incorrectly tokenize Python code containing syntax errors that the real tokenizer would reject."
|
|||
| msg141507 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2011-08-01 13:28 | |
I'm not familiar with the parser internals (I'm nosying someone who is), but I suspect what you are seeing at the command line is the errors being caught at a stage later than the tokenizer. |
|||
| msg141513 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-01 13:40 | |
These errors are generated directly by the tokenizer. In tokenizer.c, the tokenizer generates ERRORTOKEN when it encounters something it can't tokenize. This causes parsetok() in parsetok.c to stop tokenizing and return an error. |
|||
| msg141521 - (view) | Author: Benjamin Peterson (benjamin.peterson) * ![]() |
Date: 2011-08-01 14:59 | |
This should probably be fixed (patches welcome). However, note even with valid Python code, the tokens are not the same. |
|||
| msg141543 - (view) | Author: Vlad Riscutia (vladris) | Date: 2011-08-02 02:45 | |
How come tokenizer module is not based on actual C tokenizer? Wouldn't that make more sense (and prevent this kind of issues)? |
|||
| msg141545 - (view) | Author: Benjamin Peterson (benjamin.peterson) * ![]() |
Date: 2011-08-02 04:42 | |
tokenize has useful features that the builtin tokenizer does not possess such as the NL token. |
|||
| msg141624 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-04 11:54 | |
I'm having a look to see if I can make tokenize.py better match the real tokenizer, but I need some feedback on a couple of design decisions. First, how to handle tokenization errors? There are three possibilities: 1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after the error. This is what tokenize.py currently does in the two cases where it detects an error. 2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does. 3. Raise an exception (IndentationError, SyntaxError, or TabError). This is what the user sees when the parser is invoked from pythonrun.c. Since the documentation for tokenize.py says, "It is designed to match the working of the Python tokenizer exactly", I think that implementing option (2) is best here. (This will mean changing the behaviour of tokenize.py in the two cases where it currently detects an error, so that it stops tokenizing.) Second, how to record the cause of the error? The real tokenizer records the cause of the error in the 'done' field of the 'tok_state" structure, but tokenize.py loses this information. I propose to add fields to the TokenInfo structure (which is a namedtuple) to record this information. The real tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, E_DEDENT etc), and pythonrun.c converts these to English-language error messages (E_TOODEEP: "too many levels of indentation"). Both of these pieces of information will be useful, so I propose to add two fields "error" (containing a string like "TOODEEP") and "errormessage" (containing the English-language error message). |
|||
| msg141625 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-04 12:21 | |
Having looked at some of the consumers of the tokenize module, I don't think my proposed solutions will work. It seems to be the case that the resynchronization behaviour of tokenize.py is important for consumers that are using it to transform arbitrary Python source code (like 2to3.py). These consumers are relying on the "roundtrip" property that X == untokenize(tokenize(X)). So solution (1) is necessary for the handling of tokenization errors. Also, that fact that TokenInfo is a 5-tuple is relied on in some places (e.g. lib2to3/patcomp.py line 38), so it can't be extended. And there are consumers (though none in the standard library) that are relying on type=ERRORTOKEN being the way to detect errors in a tokenization stream. So I can't overload that field of the structure. Any good ideas for how to record the cause of error without breaking backwards compatibility? |
|||
| msg141626 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-04 12:26 | |
Ah ... TokenInfo is a *subclass* of namedtuple, so I can add extra properties to it without breaking consumers that expect it to be a 5-tuple. |
|||
| msg141680 - (view) | Author: Terry J. Reedy (terry.reedy) * ![]() |
Date: 2011-08-05 19:53 | |
I have not used tokenize, but if it is *not* intended to exactly reproduce the internal tokenizer behavior, the claim that it is should be amended. |
|||
| msg141690 - (view) | Author: Gareth Rees (Gareth.Rees) | Date: 2011-08-05 21:11 | |
Terry: agreed. Does anyone actually use this module? Does anyone know what the design goals are for tokenize? If someone can tell me, I'll do my best to make it meet them.
Meanwhile, here's another bug. Each character of trailing whitespace is tokenized as an ERRORTOKEN.
Python 3.3.0a0 (default:c099ba0a278e, Aug 2 2011, 12:35:03)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tokenize import tokenize,untokenize
>>> from io import BytesIO
>>> list(tokenize(BytesIO('1 '.encode('utf8')).readline))
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''), TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), line='1 '), TokenInfo(type=54 (ERRORTOKEN), string=' ', start=(1, 1), end=(1, 2), line='1 '), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2011-10-22 19:52:56 | flox | set | nosy:
+ flox |
| 2011-10-22 19:15:57 | meador.inge | set | nosy:
+ meador.inge |
| 2011-08-05 21:25:12 | sandro.tosi | set | nosy:
+ sandro.tosi |
| 2011-08-05 21:11:50 | Gareth.Rees | set | messages: + msg141690 |
| 2011-08-05 19:53:34 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg141680 |
| 2011-08-05 19:33:28 | daniel.urban | set | nosy:
+ daniel.urban |
| 2011-08-04 12:26:06 | Gareth.Rees | set | messages: + msg141626 |
| 2011-08-04 12:21:05 | Gareth.Rees | set | messages: + msg141625 |
| 2011-08-04 11:54:04 | Gareth.Rees | set | messages: + msg141624 |
| 2011-08-02 04:42:17 | benjamin.peterson | set | messages: + msg141545 |
| 2011-08-02 02:45:32 | vladris | set | nosy:
+ vladris messages: + msg141543 |
| 2011-08-01 16:30:17 | eric.snow | set | nosy:
+ eric.snow |
| 2011-08-01 15:11:46 | ezio.melotti | set | stage: test needed |
| 2011-08-01 15:11:32 | ezio.melotti | set | nosy:
+ ezio.melotti versions: + Python 2.7, Python 3.2 |
| 2011-08-01 14:59:02 | benjamin.peterson | set | messages: + msg141521 |
| 2011-08-01 13:40:41 | Gareth.Rees | set | messages: + msg141513 |
| 2011-08-01 13:28:08 | r.david.murray | set | nosy:
+ r.david.murray, benjamin.peterson messages: + msg141507 |
| 2011-08-01 12:58:39 | Gareth.Rees | create | |
