This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Parser: don't transcode input string to UTF-8 if it is already encoded to UTF-8
Type: performance Stage:
Components: Versions: Python 3.4
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, loewis, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-11-07 12:40 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
input_is_utf8.patch vstinner, 2013-11-07 12:56
Messages (6)
msg202331 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-07 12:40
Python parser (Parser/tokenizer.c) has a translate_into_utf8() function to decode a string from the input encoding and encode it to UTF-8.

This function is unnecessary if the input string is already encoded to UTF-8, which is something common nowadays. Linux, Mac OS X and many other operating systems are now using UTF-8 as the default locale encoding, UTF-8 is the default encoding for Python scripts, etc. compile(), eval() and exec() functions pass UTF-8 encoded strings to the parser.

Attached patch adds an input_is_utf8 flag to the tokenizer to skip translate_into_utf8() if the input string is already encoded to UTF-8.
msg202334 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-07 12:56
The patch has an issue, importing test.bad_coding2 (UTF-8 with a BOM) does not raise a SyntaxError anymore.
msg202339 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-07 13:48
The parser should check that the input is actually valid UTF-8 data.
msg202340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-07 14:03
> The parser should check that the input is actually valid UTF-8 data.

Ah yes, correct. It looks like input data is still checked for valid
UTF-8 data. I suppose that the byte strings should be decoded from
UTF-8 because Python 3 manipulates Unicode strings, not byte strings.

The patch only skips calls to translate_into_utf8(str, tok->encoding),
calls to translate_into_utf8(str, tok->enc) are unchanged (notice:
encoding != enc :-)).

But it looks like translate_into_utf8(str, tok->enc) is not called if
tok->enc is NULL.

If tok->encoding is "utf-8" and tok->enc is NULL, maybe the input
string is not decoded from UTF-8. But it sounds strange, because
Python uses Unicode strings.

Don't trust me, I would prefer an explanation of Benjamin who knows
better than me the parser internals :-)
msg202346 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-11-07 15:44
tok->enc and tok->encoding should always have the same value, except that tok->enc gets set earlier.

tok->enc is used when parsing from strings, to remember what codec to use. For file based parsing, the codec object created knows what encoding to use; for string-based parsing, tok->enc stores the encoding.

If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal.
msg202700 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-12 15:47
> If the code is to be simplified, unifying the cases of string-based parsing and file-based parsing might be a worthwhile goal.

Ah yes, it enc and encoding attributes are almost the same, it would be nice to merge them! But I'm not sure that I understand, do you prefer to merge them in this issue or in a new issue?
History
Date User Action Args
2022-04-11 14:57:53adminsetgithub: 63718
2015-10-02 21:09:19vstinnersetstatus: open -> closed
resolution: out of date
2013-11-12 15:47:18vstinnersetmessages: + msg202700
2013-11-07 15:44:39loewissetnosy: + loewis
messages: + msg202346
2013-11-07 14:03:11vstinnersetmessages: + msg202340
2013-11-07 13:48:31serhiy.storchakasetmessages: + msg202339
2013-11-07 12:56:52vstinnersetfiles: + input_is_utf8.patch

messages: + msg202334
2013-11-07 12:43:31vstinnersetfiles: - input_is_utf8.patch
2013-11-07 12:40:55vstinnercreate