Message 79102 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	amaury.forgeotdarc, brett.cannon, sjmachin, vstinner
Date	2009-01-05.01:56:16
SpamBayes Score	3.125528e-07
Marked as misclassified	No
Message-id	<1231120579.19.0.681538707816.issue4626@psf.upfronthosting.co.za>
In-reply-to

Content
The function decode_str() (Parser/tokenizer.c) is responsible to detect the encoding using the BOM or the cookie ("coding: xxx"). decode_str() reencodes also the text to utf-8 if the encoding is different than utf-8. I think that we can just skip this function if the input text is already unicode (utf-8). Attached patch implements this idea. The patch introduces a new compiler flag (PyCF_IGNORE_COOKIE) and a new parser flag (PyPARSE_IGNORE_COOKIE). The new compiler flag is set by source_as_string() when the input is a PyUnicode object. "Ignore cookie" is maybe not the best name for this flag. With my patch, the first Brett's example displays: $ ./python com2.py Traceback (most recent call last): File "com2.py", line 3, in <module> compile(source, '<test>', 'exec') File "<test>", line 2 ” = '”' ^ SyntaxError: invalid character in identifier The error cursor is not at the right column (bug related to the issue 2382 or introduced by my patch?). The patch changes the public API: PyTokenizer_FromString() prototype changed to get a new argument. I don't like changing public API. The new argument should be a bit vector (flags) instead of a single bit (ignore_cookie). We can avoid changing the public API by creating a new function (eg. "PyTokenizer_FromUnicode" ;-)). There are some old PyPARSE_xxx constants in Include/parsetok.h that might be removed. PyPARSE_WITH_IS_KEYWORD value is 3 which is strange since flags is a bit vector (changed with \| and tested by &). But PyPARSE_WITH_IS_KEYWORD is a dead constant (written in #if 0...#endif).

The function decode_str() (Parser/tokenizer.c) is responsible to 
detect the encoding using the BOM or the cookie ("coding: xxx"). 
decode_str() reencodes also the text to utf-8 if the encoding is 
different than utf-8. I think that we can just skip this function if 
the input text is already unicode (utf-8). Attached patch implements 
this idea.

The patch introduces a new compiler flag (PyCF_IGNORE_COOKIE) and a 
new parser flag (PyPARSE_IGNORE_COOKIE). The new compiler flag is set 
by source_as_string() when the input is a PyUnicode object. "Ignore 
cookie" is maybe not the best name for this flag.

With my patch, the first Brett's example displays:
   $ ./python com2.py
   Traceback (most recent call last):
     File "com2.py", line 3, in <module>
       compile(source, '<test>', 'exec')
     File "<test>", line 2
       ” = '”'
         ^
   SyntaxError: invalid character in identifier

The error cursor is not at the right column (bug related to the issue 
2382 or introduced by my patch?).

The patch changes the public API: PyTokenizer_FromString() prototype 
changed to get a new argument. I don't like changing public API. The 
new argument should be a bit vector (flags) instead of a single bit 
(ignore_cookie). We can avoid changing the public API by creating a 
new function (eg. "PyTokenizer_FromUnicode" ;-)). 

There are some old PyPARSE_xxx constants in Include/parsetok.h that 
might be removed. PyPARSE_WITH_IS_KEYWORD value is 3 which is strange 
since flags is a bit vector (changed with | and tested by &). But 
PyPARSE_WITH_IS_KEYWORD is a dead constant (written in #if 
0...#endif).

History
Date	User	Action	Args
2009-01-05 01:56:19	vstinner	set	recipients: + vstinner, brett.cannon, sjmachin, amaury.forgeotdarc
2009-01-05 01:56:19	vstinner	set	messageid: <1231120579.19.0.681538707816.issue4626@psf.upfronthosting.co.za>
2009-01-05 01:56:18	vstinner	link	issue4626 messages
2009-01-05 01:56:17	vstinner	create