Issue 11313: Speed up default encode()/decode()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55522

classification

Title:	Speed up default encode()/decode()
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	akuchling, belopolsky, ezio.melotti, lemburg
Priority:	normal	Keywords:	patch

Created on 2011-02-24 22:29 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
default-encode.diff	belopolsky, 2011-02-24 22:29		review
issue11313.diff	ezio.melotti, 2011-02-25 00:11	Alexander's patch + tests	review

Messages (10)
msg129318 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-02-24 22:29
In Python 3.x default encoding is always utf-8, but encode()/decode() still try to look it up. Attached patch eliminates a call to normalize_encoding and several strcmp() calls.
msg129324 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-02-25 00:11
Patch looks good. I checked the tests and couldn't fine any test for .encode()/.decode() without encoding, so I added them in the attached patch.
msg129325 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-02-25 00:25
Thanks for the review and the tests. I have found one more place that can be easily optimized. (See patch below.) The decode() methods in bytes and bytearray are not so easy unfortunately because for some reason they are written to accept any object as self, not only byte/bytearray. As a result, it is not that easy to short-circuit default case in these instances. With respect to the patch below, I'll make sure that there is a test for it. =================================================================== --- Python/getargs.c (revision 88545) +++ Python/getargs.c (working copy) @@ -1010,8 +1010,6 @@ /* Get 'e' parameter: the encoding name / encoding = (const char )va_arg(p_va, const char ); - if (encoding == NULL) - encoding = PyUnicode_GetDefaultEncoding(); /* Get output buffer parameter: 's' (recode all objects via Unicode) or @@ -1051,9 +1049,12 @@ arg, msgbuf, bufsize); /* Encode object; use default error handling */ - s = PyUnicode_AsEncodedString(u, - encoding, - NULL); + if (encoding == NULL) + s = PyUnicode_AsUTF8String(u); + else + s = PyUnicode_AsEncodedString(u, + encoding, + NULL); Py_DECREF(u); if (s == NULL) return converterr("(encoding failed)",
msg129327 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-02-25 01:21
Committed issue11313.diff in revision 88553. On the second thought, the getargs optimization is not worth the trouble because in existing sources 'e' code is used with constant encodings and one is unlikely to pass NULL as an encoding because that is equivalent to eliding the 'e' code altogether. I will keep this issue open to consider whether remaining if (encoding == NULL) encoding = PyUnicode_GetDefaultEncoding(); clauses can be simply removed because the lower level code accepts NULL in encoding.
msg129338 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-02-25 08:36
Alexander Belopolsky wrote: > > New submission from Alexander Belopolsky <belopolsky@users.sourceforge.net>: > > In Python 3.x default encoding is always utf-8, but encode()/decode() still try to look it up. Attached patch eliminates a call to normalize_encoding and several strcmp() calls. +1. Please add a comment explaining why this can be done.
msg133573 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-04-12 11:45
> Committed issue11313.diff in revision 88553. The revision number seems to be wrong -- maybe the commit got lost during the mercurial migration. All the changes of the patch seem to be there though, even if they went in with other commits: unicodeobjects.c: 7ab569b18c98 together with changes from #11303. test_bytes.py: be6c38d1817b together with the normalization of encoding names.
msg171351 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-09-26 17:32
Can this be closed?
msg174729 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2012-11-04 01:06
MAL suggested adding "a comment explaining why this can be done". If Alexander or someone wants to do that, great! Otherwise, there seems no other reason to leave this issue open.
msg174740 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2012-11-04 02:13
I don't think a comment explaining that default encoding is utf-8 in python 3.x will improve readability of this code. if (encoding == NULL) return PyUnicode_DecodeUTF8(s, size, errors); seems quite self-explanatory.
msg175218 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-11-09 09:47
Agreed, closing.

History
Date	User	Action	Args
2022-04-11 14:57:13	admin	set	github: 55522
2012-11-09 09:47:57	ezio.melotti	set	status: open -> closed versions: + Python 3.4 messages: + msg175218 resolution: fixed stage: commit review -> resolved
2012-11-04 02:13:41	belopolsky	set	messages: + msg174740
2012-11-04 01:06:21	akuchling	set	nosy: + akuchling messages: + msg174729
2012-09-26 17:32:35	ezio.melotti	set	messages: + msg171351
2011-04-12 11:45:24	ezio.melotti	set	messages: + msg133573
2011-02-25 08:36:02	lemburg	set	nosy: + lemburg messages: + msg129338
2011-02-25 01:21:19	belopolsky	set	nosy: belopolsky, ezio.melotti messages: + msg129327
2011-02-25 00:25:59	belopolsky	set	nosy: belopolsky, ezio.melotti messages: + msg129325
2011-02-25 00:11:48	ezio.melotti	set	files: + issue11313.diff nosy: belopolsky, ezio.melotti messages: + msg129324 components: + Interpreter Core stage: commit review
2011-02-24 22:30:30	ezio.melotti	set	nosy: + ezio.melotti
2011-02-24 22:29:46	belopolsky	create