This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Improve encoding shortcuts in PyUnicode_AsEncodedString()
Type: performance Stage:
Components: Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-06-06 18:23 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_shortcuts.patch vstinner, 2010-06-06 18:23
Messages (8)
msg107203 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-06 18:23
PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). 

PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1").

Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().
msg107280 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-07 21:25
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). 
> 
> PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1").
> 
> Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().

The normalization in PyUnicode_Decode() must have been added to
Python3 only. It is not present in Python2.

I'm not sure whether it's a good idea to extend this further:
the shortcuts were meant for Python internal use only. Python
itself and it's stdlib should only use the shortcut names
for the resp. special encodings and no variants.

Dealing with variants and normalization is left to the encodings
package and its alias machinery.

Since the Python stdlib and the core already mostly use
the shortcut names, adding normalization won't buy us much.

Note that your change has also made it impossible for the
compiler to do loop unrolling - there's not upper limit
on the size of lower anymore.

In terms of coding style, "static" should go on a separate line.
msg107285 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-07 22:00
> the shortcuts were meant for Python internal use only

str.encode() calls PyUnicode_AsEncodedString() and bytes.decode() calls PyUnicode_Decode(), so it is not for internal use only. Eg. "text".encode("ASCII") doesn't use the fastpath.
msg107286 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-07 22:05
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> the shortcuts were meant for Python internal use only
> 
> str.encode() calls PyUnicode_AsEncodedString() and bytes.decode() calls PyUnicode_Decode(), so it is not for internal use only. Eg. "text".encode("ASCII") doesn't use the fastpath.

Right. As I said: the *shortcuts* are meant for internal use
only. External code should not rely on them, but can, of course,
use those canonical names as well.

Note that these shortcut bypass the codec registry logic. Codec
search functions cannot redirect these shortcuts to their
own implementations, so we have to be careful about adding more
such shortcuts.
msg107287 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-07 22:32
> Note that these shortcut bypass the codec registry logic.

Yes, but it's already the case without my patch. I don't think that it's really useful to override latin1, utf-8, utf-16, utf-32 or mbcs. I prefer a faster Python :-) 

> we have to be careful about adding more such shortcuts.

I just want to add a shortcut for ISO-8859-1.
msg107452 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-10 10:34
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Note that these shortcut bypass the codec registry logic.
> 
> Yes, but it's already the case without my patch. I don't think that it's really useful to override latin1, utf-8, utf-16, utf-32 or mbcs. I prefer a faster Python :-) 

Depends on your use case. E.g. utf-32 is hardly ever used in practice,
utf-16 is only common on Windows and then only as utf-16-le,
I'm not sure about mbcs since that's a meta-codec. In reality, this
will likely be the same as cp1252 most of the time.

I'm ok on ascii, latin1, utf-8 and mbcs (including the additional
normalization, aliasiing and case mapping), but not on the others.

>> we have to be careful about adding more such shortcuts.
> 
> I just want to add a shortcut for ISO-8859-1.

Fine, even though that name is really not used much in Python code.
msg107456 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 12:02
Commited in 3.2 (r81869), blocked in 3.1 (r81870).

--

Oops, I don't know why I wrote utf-16 and utf-32. I don't want to add them to the shortcuts.
msg107459 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 13:48
Le jeudi 10 juin 2010 14:02:34, vous avez écrit :
> Commited in 3.2 (r81869), blocked in 3.1 (r81870).

This commit introduced a regression: ISO-8859-15 was seen as an alias to 
ISO-8859-1 because the normalized string was truncated. Fixed in r81871 
(blocked in 3.1: r81872).
History
Date User Action Args
2022-04-11 14:57:01adminsetgithub: 53168
2010-06-10 13:48:20vstinnersetmessages: + msg107459
2010-06-10 12:02:32vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg107456
2010-06-10 10:35:00lemburgsetmessages: + msg107452
2010-06-07 22:32:59vstinnersetmessages: + msg107287
2010-06-07 22:05:28lemburgsetmessages: + msg107286
2010-06-07 22:00:10vstinnersetmessages: + msg107285
2010-06-07 21:25:03lemburgsetnosy: + lemburg
messages: + msg107280
2010-06-06 18:23:56vstinnercreate