This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: encoding package's normalize_encoding() function is too slow
Type: performance Stage:
Components: Unicode Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, gregory.p.smith, jcea, lemburg, sdaoden, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2011-02-25 15:55 by lemburg, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
encoding_normalize_optimize.patch methane, 2016-12-15 09:24 review
Messages (10)
msg129386 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-25 15:55
I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.

The original version used the .translate() method which is a lot faster and can be adapted to work with the Unicode variant of the .translate() method just as well.

_norm_encoding_map = ('                                              . '
                      '0123456789       ABCDEFGHIJKLMNOPQRSTUVWXYZ     '
                      ' abcdefghijklmnopqrstuvwxyz                     '
                      '                                                '
                      '                                                '
                      '                ')

def normalize_encoding(encoding):

    """ Normalize an encoding name.

        Normalization works as follows: all non-alphanumeric
        characters except the dot used for Python package names are
        collapsed and replaced with a single underscore, e.g. '  -;#'
        becomes '_'. Leading and trailing underscores are removed.

        Note that encoding names should be ASCII only; if they do use
        non-ASCII characters, these must be Latin-1 compatible.

    """
    # Make sure we have an 8-bit string, because .translate() works
    # differently for Unicode strings.
    if hasattr(__builtin__, "unicode") and isinstance(encoding, unicode):
        # Note that .encode('latin-1') does *not* use the codec
        # registry, so this call doesn't recurse. (See unicodeobject.c
        # PyUnicode_AsEncodedString() for details)
        encoding = encoding.encode('latin-1')
    return '_'.join(encoding.translate(_norm_encoding_map).split())
msg129389 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-25 16:34
I don't think the normalize_encoding() function was the culprit for issue11303 because I measured timings with timeit which averages multiple runs while normalize_encoding() is called only the one time per encoding spelling due to caching.
msg129460 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-02-25 23:03
We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization):

 - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters
 - normalize_encoding() in unicodeobject.c
 - normalizestring() in codecs.c

normalize_encoding() in encodings is more laxist than the two other functions: it normalizes "  utf   8  " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é".

I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.
msg129463 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-25 23:06
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization):
> 
>  - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters
>  - normalize_encoding() in unicodeobject.c
>  - normalizestring() in codecs.c
> 
> normalize_encoding() in encodings is more laxist than the two other functions: it normalizes "  utf   8  " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é".
> 
> I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.

Please see this message for an explanation of why we have those
three functions, why they are different and what their application
space is:

http://bugs.python.org/issue5902#msg129257

This ticket is just about the encoding package's codec search
function, not the other two, and I don't want to change
semantics, just its performance.
msg165517 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-15 11:16
> I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.

See changeset 54ef645d08e4.
msg220630 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-15 13:02
What's the status of this issue, as we've lived with this really slow implementation for well over three years?
msg220633 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2014-06-15 13:19
On 15.06.2014 15:02, Mark Lawrence wrote:
> 
> What's the status of this issue, as we've lived with this really slow implementation for well over three years?

I guess it just needs someone to write a patch.

Note that encoding lookups are cached, so the slowness only
becomes an issue if you lookup lots of different encodings.
msg283266 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-12-15 09:34
Thanks for the patch.

Victor has implemented the function in C, AFAIK, so an even better approach would be to expose that function at the Python level and use it in the encodings package.
msg283271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 09:53
It seems like encodings.normalize_encoding() currently has no unit test! Before modifying it, I would prefer to see a few unit tests:

* " utf 8 "
* "UtF 8"
* "utf8\xE9"
* etc.

Since we are talking about an optimmization, I would like to see a benchmark result before/after. I also would like to test Marc-Andre's idea of exposing the C function _Py_normalize_encoding().

_Py_normalize_encoding() works on a byte string encoded to Latin1. To implement encodings.normalize_encoding(), we might rewrite the function to work on Py_UCS4 character, or have a fast version on char*, and a more generic version for UCS2 and UCS4?
msg283333 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-15 16:11
Oh, while reading Mercurial history, I found a note that I wrote:

"It's not exactly the same than encodings.normalize_encoding(): the C function also converts to lowercase."

IHMO it's fine to modify encodings.normalize_encoding() to also convert to lower-case.
History
Date User Action Args
2022-04-11 14:57:13adminsetgithub: 55531
2022-01-24 23:38:06gregory.p.smithsetnosy: + gregory.p.smith
2016-12-15 16:11:31vstinnersetmessages: + msg283333
2016-12-15 09:53:01vstinnersetmessages: + msg283271
2016-12-15 09:34:52lemburgsetmessages: + msg283266
versions: + Python 3.7, - Python 3.4, Python 3.5
2016-12-15 09:27:10BreamoreBoysetnosy: - BreamoreBoy
2016-12-15 09:24:30methanesetfiles: + encoding_normalize_optimize.patch
keywords: + patch
2014-06-15 13:19:56lemburgsetmessages: + msg220633
2014-06-15 13:02:58BreamoreBoysetnosy: + BreamoreBoy

messages: + msg220630
versions: + Python 3.4, Python 3.5, - Python 3.3
2012-07-15 11:16:20serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg165517
2011-03-01 16:55:06jceasetnosy: + jcea
2011-02-26 10:09:45sdaodensetnosy: + sdaoden
2011-02-25 23:06:49lemburgsetnosy: lemburg, belopolsky, vstinner, ezio.melotti
messages: + msg129463
title: encoding package's normalize_encoding() function is too slow -> encoding package's normalize_encoding() function is too slow
2011-02-25 23:03:06vstinnersetnosy: + vstinner
messages: + msg129460
2011-02-25 19:33:57belopolskylinkissue11303 superseder
2011-02-25 16:34:58belopolskysetnosy: lemburg, belopolsky, ezio.melotti
messages: + msg129389
2011-02-25 16:12:54ezio.melottisetnosy: + ezio.melotti, belopolsky
2011-02-25 15:55:31lemburgcreate