This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Recipients benjamin.peterson, eric.snow, ezio.melotti,, r.david.murray, vladris
Date 2011-08-04.11:54:03
SpamBayes Score 2.4702317e-07
Marked as misclassified No
Message-id <>
I'm having a look to see if I can make better match the real tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after the error. This is what currently does in the two cases where it detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for says, "It is designed to match the working of the Python tokenizer exactly", I think that implementing option (2) is best here. (This will mean changing the behaviour of in the two cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the cause of the error in the 'done' field of the 'tok_state" structure, but loses this information. I propose to add fields to the TokenInfo structure (which is a namedtuple) to record this information. The real tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, E_DEDENT etc), and pythonrun.c converts these to English-language error messages (E_TOODEEP: "too many levels of indentation"). Both of these pieces of information will be useful, so I propose to add two fields "error" (containing a string like "TOODEEP") and "errormessage" (containing the English-language error message).
Date User Action Args
2011-08-04 11:54:05gdr@garethrees.orgsetrecipients: +, benjamin.peterson, ezio.melotti, r.david.murray, eric.snow, vladris
2011-08-04 11:54:05gdr@garethrees.orgsetmessageid: <>
2011-08-04 11:54:04gdr@garethrees.orglinkissue12675 messages
2011-08-04 11:54:03gdr@garethrees.orgcreate