Message79102
The function decode_str() (Parser/tokenizer.c) is responsible to
detect the encoding using the BOM or the cookie ("coding: xxx").
decode_str() reencodes also the text to utf-8 if the encoding is
different than utf-8. I think that we can just skip this function if
the input text is already unicode (utf-8). Attached patch implements
this idea.
The patch introduces a new compiler flag (PyCF_IGNORE_COOKIE) and a
new parser flag (PyPARSE_IGNORE_COOKIE). The new compiler flag is set
by source_as_string() when the input is a PyUnicode object. "Ignore
cookie" is maybe not the best name for this flag.
With my patch, the first Brett's example displays:
$ ./python com2.py
Traceback (most recent call last):
File "com2.py", line 3, in <module>
compile(source, '<test>', 'exec')
File "<test>", line 2
” = '”'
^
SyntaxError: invalid character in identifier
The error cursor is not at the right column (bug related to the issue
2382 or introduced by my patch?).
The patch changes the public API: PyTokenizer_FromString() prototype
changed to get a new argument. I don't like changing public API. The
new argument should be a bit vector (flags) instead of a single bit
(ignore_cookie). We can avoid changing the public API by creating a
new function (eg. "PyTokenizer_FromUnicode" ;-)).
There are some old PyPARSE_xxx constants in Include/parsetok.h that
might be removed. PyPARSE_WITH_IS_KEYWORD value is 3 which is strange
since flags is a bit vector (changed with | and tested by &). But
PyPARSE_WITH_IS_KEYWORD is a dead constant (written in #if
0...#endif). |
|
Date |
User |
Action |
Args |
2009-01-05 01:56:19 | vstinner | set | recipients:
+ vstinner, brett.cannon, sjmachin, amaury.forgeotdarc |
2009-01-05 01:56:19 | vstinner | set | messageid: <1231120579.19.0.681538707816.issue4626@psf.upfronthosting.co.za> |
2009-01-05 01:56:18 | vstinner | link | issue4626 messages |
2009-01-05 01:56:17 | vstinner | create | |
|