Date 2011-08-04.11:54:03
I'm having a look to see if I can make better match the real tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after the error. This is what currently does in the two cases where it detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for says, "It is designed to match the working of the Python tokenizer exactly", I think that implementing option (2) is best here. (This will mean changing the behaviour of in the two cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the cause of the error in the 'done' field of the 'tok_state" structure, but loses this information. I propose to add fields to the TokenInfo structure (which is a namedtuple) to record this information. The real tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, E_DEDENT etc), and pythonrun.c converts these to English-language error messages (E_TOODEEP: "too many levels of indentation"). Both of these pieces of information will be useful, so I propose to add two fields "error" (containing a string like "TOODEEP") and "errormessage" (containing the English-language error message).
