Issue 8922: Improve encoding shortcuts in PyUnicode_AsEncodedString()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53168

classification

Title:	Improve encoding shortcuts in PyUnicode_AsEncodedString()
Type:	performance	Stage:
Components:	Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	lemburg, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-06-06 18:23 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_shortcuts.patch	vstinner, 2010-06-06 18:23

Messages (8)
msg107203 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-06 18:23
PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1"). Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().
msg107280 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-07 21:25
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). > > PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1"). > > Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString(). The normalization in PyUnicode_Decode() must have been added to Python3 only. It is not present in Python2. I'm not sure whether it's a good idea to extend this further: the shortcuts were meant for Python internal use only. Python itself and it's stdlib should only use the shortcut names for the resp. special encodings and no variants. Dealing with variants and normalization is left to the encodings package and its alias machinery. Since the Python stdlib and the core already mostly use the shortcut names, adding normalization won't buy us much. Note that your change has also made it impossible for the compiler to do loop unrolling - there's not upper limit on the size of lower anymore. In terms of coding style, "static" should go on a separate line.
msg107285 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-07 22:00
> the shortcuts were meant for Python internal use only str.encode() calls PyUnicode_AsEncodedString() and bytes.decode() calls PyUnicode_Decode(), so it is not for internal use only. Eg. "text".encode("ASCII") doesn't use the fastpath.
msg107286 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-07 22:05
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> the shortcuts were meant for Python internal use only > > str.encode() calls PyUnicode_AsEncodedString() and bytes.decode() calls PyUnicode_Decode(), so it is not for internal use only. Eg. "text".encode("ASCII") doesn't use the fastpath. Right. As I said: the shortcuts are meant for internal use only. External code should not rely on them, but can, of course, use those canonical names as well. Note that these shortcut bypass the codec registry logic. Codec search functions cannot redirect these shortcuts to their own implementations, so we have to be careful about adding more such shortcuts.
msg107287 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-07 22:32
> Note that these shortcut bypass the codec registry logic. Yes, but it's already the case without my patch. I don't think that it's really useful to override latin1, utf-8, utf-16, utf-32 or mbcs. I prefer a faster Python :-) > we have to be careful about adding more such shortcuts. I just want to add a shortcut for ISO-8859-1.
msg107452 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-10 10:34
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> Note that these shortcut bypass the codec registry logic. > > Yes, but it's already the case without my patch. I don't think that it's really useful to override latin1, utf-8, utf-16, utf-32 or mbcs. I prefer a faster Python :-) Depends on your use case. E.g. utf-32 is hardly ever used in practice, utf-16 is only common on Windows and then only as utf-16-le, I'm not sure about mbcs since that's a meta-codec. In reality, this will likely be the same as cp1252 most of the time. I'm ok on ascii, latin1, utf-8 and mbcs (including the additional normalization, aliasiing and case mapping), but not on the others. >> we have to be careful about adding more such shortcuts. > > I just want to add a shortcut for ISO-8859-1. Fine, even though that name is really not used much in Python code.
msg107456 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-10 12:02
Commited in 3.2 (r81869), blocked in 3.1 (r81870). -- Oops, I don't know why I wrote utf-16 and utf-32. I don't want to add them to the shortcuts.
msg107459 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-06-10 13:48
Le jeudi 10 juin 2010 14:02:34, vous avez écrit : > Commited in 3.2 (r81869), blocked in 3.1 (r81870). This commit introduced a regression: ISO-8859-15 was seen as an alias to ISO-8859-1 because the normalized string was truncated. Fixed in r81871 (blocked in 3.1: r81872).

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53168
2010-06-10 13:48:20	vstinner	set	messages: + msg107459
2010-06-10 12:02:32	vstinner	set	status: open -> closed resolution: fixed messages: + msg107456
2010-06-10 10:35:00	lemburg	set	messages: + msg107452
2010-06-07 22:32:59	vstinner	set	messages: + msg107287
2010-06-07 22:05:28	lemburg	set	messages: + msg107286
2010-06-07 22:00:10	vstinner	set	messages: + msg107285
2010-06-07 21:25:03	lemburg	set	nosy: + lemburg messages: + msg107280
2010-06-06 18:23:56	vstinner	create