classification
Title: encoding package's normalize_encoding() function is too slow
Type: performance Stage:
Components: Unicode Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, haypo, jcea, lemburg, sdaoden, serhiy.storchaka
Priority: normal Keywords:

Created on 2011-02-25 15:55 by lemburg, last changed 2012-07-15 11:16 by serhiy.storchaka.

Messages (5)
msg129386 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-25 15:55
I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.

The original version used the .translate() method which is a lot faster and can be adapted to work with the Unicode variant of the .translate() method just as well.

_norm_encoding_map = ('                                              . '
                      '0123456789       ABCDEFGHIJKLMNOPQRSTUVWXYZ     '
                      ' abcdefghijklmnopqrstuvwxyz                     '
                      '                                                '
                      '                                                '
                      '                ')

def normalize_encoding(encoding):

    """ Normalize an encoding name.

        Normalization works as follows: all non-alphanumeric
        characters except the dot used for Python package names are
        collapsed and replaced with a single underscore, e.g. '  -;#'
        becomes '_'. Leading and trailing underscores are removed.

        Note that encoding names should be ASCII only; if they do use
        non-ASCII characters, these must be Latin-1 compatible.

    """
    # Make sure we have an 8-bit string, because .translate() works
    # differently for Unicode strings.
    if hasattr(__builtin__, "unicode") and isinstance(encoding, unicode):
        # Note that .encode('latin-1') does *not* use the codec
        # registry, so this call doesn't recurse. (See unicodeobject.c
        # PyUnicode_AsEncodedString() for details)
        encoding = encoding.encode('latin-1')
    return '_'.join(encoding.translate(_norm_encoding_map).split())
msg129389 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2011-02-25 16:34
I don't think the normalize_encoding() function was the culprit for issue11303 because I measured timings with timeit which averages multiple runs while normalize_encoding() is called only the one time per encoding spelling due to caching.
msg129460 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-02-25 23:03
We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization):

 - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters
 - normalize_encoding() in unicodeobject.c
 - normalizestring() in codecs.c

normalize_encoding() in encodings is more laxist than the two other functions: it normalizes "  utf   8  " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é".

I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.
msg129463 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-02-25 23:06
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> We should first implement the same algorithm of the 3 normalization functions and add tests for them (at least for the function in normalization):
> 
>  - normalize_encoding() in encodings: it doesn't convert to lowercase and keep non-ASCII letters
>  - normalize_encoding() in unicodeobject.c
>  - normalizestring() in codecs.c
> 
> normalize_encoding() in encodings is more laxist than the two other functions: it normalizes "  utf   8  " to 'utf_8'. But it doesn't convert to lowercase and keeps non-ASCII letters: "UTF-8é" is normalized "UTF_8é".
> 
> I don't know if the normalization functions have to be more or less strict, but I think that they should all give the same result.

Please see this message for an explanation of why we have those
three functions, why they are different and what their application
space is:

http://bugs.python.org/issue5902#msg129257

This ticket is just about the encoding package's codec search
function, not the other two, and I don't want to change
semantics, just its performance.
msg165517 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-15 11:16
> I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.

See changeset 54ef645d08e4.
History
Date User Action Args
2012-07-15 11:16:20serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg165517
2011-03-01 16:55:06jceasetnosy: + jcea
2011-02-26 10:09:45sdaodensetnosy: + sdaoden
2011-02-25 23:06:49lemburgsetnosy: lemburg, belopolsky, haypo, ezio.melotti
messages: + msg129463
title: encoding package's normalize_encoding() function is too slow -> encoding package's normalize_encoding() function is too slow
2011-02-25 23:03:06hayposetnosy: + haypo
messages: + msg129460
2011-02-25 19:33:57belopolskylinkissue11303 superseder
2011-02-25 16:34:58belopolskysetnosy: lemburg, belopolsky, ezio.melotti
messages: + msg129389
2011-02-25 16:12:54ezio.melottisetnosy: + ezio.melotti, belopolsky
2011-02-25 15:55:31lemburgcreate