This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients benjamin.peterson, serhiy.storchaka, vstinner
Date 2013-11-07.12:40:54
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1383828055.35.0.303135915018.issue19519@psf.upfronthosting.co.za>
In-reply-to
Content
Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8.

This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser.

Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.
History
Date User Action Args
2013-11-07 12:40:55vstinnersetrecipients: + vstinner, benjamin.peterson, serhiy.storchaka
2013-11-07 12:40:55vstinnersetmessageid: <1383828055.35.0.303135915018.issue19519@psf.upfronthosting.co.za>
2013-11-07 12:40:55vstinnerlinkissue19519 messages
2013-11-07 12:40:55vstinnercreate