Message 202331 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	benjamin.peterson, serhiy.storchaka, vstinner
Date	2013-11-07.12:40:54
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1383828055.35.0.303135915018.issue19519@psf.upfronthosting.co.za>
In-reply-to

Content
Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8. This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser. Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.

Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8.

This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser.

Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.

History
Date	User	Action	Args
2013-11-07 12:40:55	vstinner	set	recipients: + vstinner, benjamin.peterson, serhiy.storchaka
2013-11-07 12:40:55	vstinner	set	messageid: <1383828055.35.0.303135915018.issue19519@psf.upfronthosting.co.za>
2013-11-07 12:40:55	vstinner	link	issue19519 messages
2013-11-07 12:40:55	vstinner	create