Issue 11322: encoding package's normalize_encoding() function is too slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55531

classification

Title:	encoding package's normalize_encoding() function is too slow
Type:	performance	Stage:
Components:	Unicode	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, ezio.melotti, gregory.p.smith, jcea, lemburg, sdaoden, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-02-25 15:55 by lemburg, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
encoding_normalize_optimize.patch	methane, 2016-12-15 09:24		review

Messages (10)
msg129386 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-02-25 15:55
I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation. The original version used the .translate() method which is a lot faster and can be adapted to work with the Unicode variant of the .translate() method just as well. _norm_encoding_map = (' . ' '0123456789 ABCDEFGHIJKLMNOPQRSTUVWXYZ ' ' abcdefghijklmnopqrstuvwxyz ' ' ' ' ' ' ') def normalize_encoding(encoding): """ Normalize an encoding name. Normalization works as follows: all non-alphanumeric characters except the dot used for Python package names are collapsed and replaced with a single underscore, e.g. ' -;#' becomes '_'. Leading and trailing underscores are removed. Note that encoding names should be ASCII only; if they do use non-ASCII characters, these must be Latin-1 compatible. """ # Make sure we have an 8-bit string, because .translate() works # differently for Unicode strings. if hasattr(__builtin__, "unicode") and isinstance(encoding, unicode): # Note that .encode('latin-1') does not use the codec # registry, so this call doesn't recurse. (See unicodeobject.c # PyUnicode_AsEncodedString() for details) encoding = encoding.encode('latin-1') return '_'.join(encoding.translate(_norm_encoding_map).split())
msg129389 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2011-02-25 16:34
I don't think the normalize_encoding() function was the culprit for issue11303 because I measured timings with timeit which averages multiple runs while normalize_encoding() is called only the one time per encoding spelling due to caching.
msg129460 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-02-25 23:03
We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization): - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters - normalize_encoding() in unicodeobject.c - normalizestring() in codecs.c normalize_encoding() in encodings is more laxist than the two other functions: it normalizes " utf 8 " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é". I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.
msg129463 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-02-25 23:06
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization): > > - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters > - normalize_encoding() in unicodeobject.c > - normalizestring() in codecs.c > > normalize_encoding() in encodings is more laxist than the two other functions: it normalizes " utf 8 " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é". > > I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result. Please see this message for an explanation of why we have those three functions, why they are different and what their application space is: http://bugs.python.org/issue5902#msg129257 This ticket is just about the encoding package's codec search function, not the other two, and I don't want to change semantics, just its performance.
msg165517 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-07-15 11:16
> I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation. See changeset 54ef645d08e4.
msg220630 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2014-06-15 13:02
What's the status of this issue, as we've lived with this really slow implementation for well over three years?
msg220633 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2014-06-15 13:19
On 15.06.2014 15:02, Mark Lawrence wrote: > > What's the status of this issue, as we've lived with this really slow implementation for well over three years? I guess it just needs someone to write a patch. Note that encoding lookups are cached, so the slowness only becomes an issue if you lookup lots of different encodings.
msg283266 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-12-15 09:34
Thanks for the patch. Victor has implemented the function in C, AFAIK, so an even better approach would be to expose that function at the Python level and use it in the encodings package.
msg283271 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-15 09:53
It seems like encodings.normalize_encoding() currently has no unit test! Before modifying it, I would prefer to see a few unit tests: * " utf 8 " * "UtF 8" * "utf8\xE9" * etc. Since we are talking about an optimmization, I would like to see a benchmark result before/after. I also would like to test Marc-Andre's idea of exposing the C function _Py_normalize_encoding(). _Py_normalize_encoding() works on a byte string encoded to Latin1. To implement encodings.normalize_encoding(), we might rewrite the function to work on Py_UCS4 character, or have a fast version on char*, and a more generic version for UCS2 and UCS4?
msg283333 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-15 16:11
Oh, while reading Mercurial history, I found a note that I wrote: "It's not exactly the same than encodings.normalize_encoding(): the C function also converts to lowercase." IHMO it's fine to modify encodings.normalize_encoding() to also convert to lower-case.

History
Date	User	Action	Args
2022-04-11 14:57:13	admin	set	github: 55531
2022-01-24 23:38:06	gregory.p.smith	set	nosy: + gregory.p.smith
2016-12-15 16:11:31	vstinner	set	messages: + msg283333
2016-12-15 09:53:01	vstinner	set	messages: + msg283271
2016-12-15 09:34:52	lemburg	set	messages: + msg283266 versions: + Python 3.7, - Python 3.4, Python 3.5
2016-12-15 09:27:10	BreamoreBoy	set	nosy: - BreamoreBoy
2016-12-15 09:24:30	methane	set	files: + encoding_normalize_optimize.patch keywords: + patch
2014-06-15 13:19:56	lemburg	set	messages: + msg220633
2014-06-15 13:02:58	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg220630 versions: + Python 3.4, Python 3.5, - Python 3.3
2012-07-15 11:16:20	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg165517
2011-03-01 16:55:06	jcea	set	nosy: + jcea
2011-02-26 10:09:45	sdaoden	set	nosy: + sdaoden
2011-02-25 23:06:49	lemburg	set	nosy: lemburg, belopolsky, vstinner, ezio.melotti messages: + msg129463 title: encoding package's normalize_encoding() function is too slow -> encoding package's normalize_encoding() function is too slow
2011-02-25 23:03:06	vstinner	set	nosy: + vstinner messages: + msg129460
2011-02-25 19:33:57	belopolsky	link	issue11303 superseder
2011-02-25 16:34:58	belopolsky	set	nosy: lemburg, belopolsky, ezio.melotti messages: + msg129389
2011-02-25 16:12:54	ezio.melotti	set	nosy: + ezio.melotti, belopolsky
2011-02-25 15:55:31	lemburg	create