Message 169385 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	alex.hartwig, asvetlov, ezio.melotti, loewis
Date	2012-08-29.14:19:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1346249952.38.0.762248489122.issue15809@psf.upfronthosting.co.za>
In-reply-to

Content
The problem is that IDLE passes an UTF-8 encoded source string to compile, and compile, in the absence of a source encoding, uses the PEP 263 default source encoding, i.e. Latin-1. As the consequence, the variable s has the value u'\\xd0\\xa0\\xd1\\x83\\xd1\\x81\\xd1\\x81\\xd0\\xba\\xd0\\xb8\\xd0\\xb9 \\xd1\\x82\\xd0\\xb5\\xd0\\xba\\xd1\\x81\\xd1\\x82' IDLE's "Default Source Encoding" is irrelevant - it only applies to editor windows. One solution for that is the attached patch. However, this patch isn't right, since it will cause all source to be interpreted as UTF-8. This would be wrong when the sys.stdin.encoding is not UTF-8, and byte string objects are created in interactive mode. Interactive mode manages to get it right by looking up sys.stdin.encoding during compilation, but it does so only when in interactive mode (i.e. when tok->prompt != NULL. I don't see any way to fix this problem in Python 2. It is fixed in Python 3, basically by always assuming that the source encoding is UTF-8, by making all string objects Unicode objects, and disallowing non-ASCII characters in bytes literals

The problem is that IDLE passes an UTF-8 encoded source string to compile, and compile, in the absence of a source encoding, uses the PEP 263 default source encoding, i.e. Latin-1.

As the consequence, the variable s has the value

u'\\xd0\\xa0\\xd1\\x83\\xd1\\x81\\xd1\\x81\\xd0\\xba\\xd0\\xb8\\xd0\\xb9 \\xd1\\x82\\xd0\\xb5\\xd0\\xba\\xd1\\x81\\xd1\\x82'

IDLE's "Default Source Encoding" is irrelevant - it only applies to editor windows.

One solution for that is the attached patch. However, this patch isn't right, since it will cause all source to be interpreted as UTF-8. This would be wrong when the sys.stdin.encoding is not UTF-8, and byte string objects are created in interactive mode.

Interactive mode manages to get it right by looking up sys.stdin.encoding during compilation, but it does so only when in interactive mode (i.e. when tok->prompt != NULL.

I don't see any way to fix this problem in Python 2. It is fixed in Python 3, basically by always assuming that the source encoding is UTF-8, by making all string objects Unicode objects, and disallowing non-ASCII characters in bytes literals

History
Date	User	Action	Args
2012-08-29 14:19:12	loewis	set	recipients: + loewis, ezio.melotti, asvetlov, alex.hartwig
2012-08-29 14:19:12	loewis	set	messageid: <1346249952.38.0.762248489122.issue15809@psf.upfronthosting.co.za>
2012-08-29 14:19:11	loewis	link	issue15809 messages
2012-08-29 14:19:11	loewis	create