Message 71852 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	benjamin.peterson, brett.cannon, loewis
Date	2008-08-24.19:19:01
SpamBayes Score	1.673982e-05
Marked as misclassified	No
Message-id	<48B1B424.4040406@v.loewis.de>
In-reply-to	<1219601527.4.0.616660160577.issue3574@psf.upfronthosting.co.za>

Content
> As for treating Latin-1 as a raw encoding, how can that be theoretically > okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1? The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all non-ASCII bytes on to the AST, which then needs to deal with them; it then could (but apparently doesn't) take into account whether the internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode for some remains of that. The other case (besides string literals) where bytes > 127 matter is tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but could be easily extended to support Latin-1 as well). The third case where non-ASCII bytes are allowed is comments; there they are entirely ignored (i.e. it is not even verified that the comment is well-formed UTF-8). Removal of the special case should simplify the code; I would agree that any speedup gained by not going through a codec is irrelevant. I'm still puzzled why test_imp if the special case is removed.

> As for treating Latin-1 as a raw encoding, how can that be theoretically
> okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?

The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all
non-ASCII bytes on to the AST, which then needs to deal with them;
it then could (but apparently doesn't) take into account whether the
internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode
for some remains of that.

The other case (besides string literals) where bytes > 127 matter is
tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but
could be easily extended to support Latin-1 as well).

The third case where non-ASCII bytes are allowed is comments; there
they are entirely ignored (i.e. it is not even verified that the
comment is well-formed UTF-8).

Removal of the special case should simplify the code; I would agree
that any speedup gained by not going through a codec is irrelevant.
I'm still puzzled why test_imp if the special case is removed.

History
Date	User	Action	Args
2008-08-24 19:19:04	loewis	set	recipients: + loewis, brett.cannon, benjamin.peterson
2008-08-24 19:19:03	loewis	link	issue3574 messages
2008-08-24 19:19:01	loewis	create