Message 129257 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, ezio.melotti, georg.brandl, lemburg, mrabarnett, pitrou
Date	2011-02-24.09:29:13
SpamBayes Score	4.6684878e-14
Marked as misclassified	No
Message-id	<4D6624E8.9020508@egenix.com>
In-reply-to	<1298520054.8.0.0591201159241.issue5902@psf.upfronthosting.co.za>

Content
Alexander Belopolsky wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names. I think you are misunderstanding the way the codec registry works. You register codec search functions with it which then have to try to map a given encoding name to a codec module. The stdlib ships with one such function (defined in encodings/__init__.py). This is registered with the codec registry per default. The codec search function takes care of any normalization and conversion to the module name used by the codecs from that codec package. > It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode. > > 1. normalize_encoding() in unicodeobject.c This was added to have the few shortcuts we have in the C code for commonly used codecs match more encoding aliases. The shortcuts completely bypass the codec registry and also bypass the function call overhead incurred by codecs run via the codec registry. > 2. normalizestring() in codecs.c This is the normalization applied by the codec registry. See PEP 100 for details: """ Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores, ... """ > 3. normalize_encoding() in encodings/__init__.py This is part of the stdlib encodings package's codec search function. > Each performs a slightly different transformation and only the last one strips non-alphanumeric characters. > > The complexity of codec lookup is comparable with that of the import mechanism! It's flexible, but not really complex. I hope the above clarifies the reasons for the three normalization functions.

Alexander Belopolsky wrote:
> 
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
> 
> Ezio and I discussed on IRC the implementation of alias lookup and neither of us was able to point out to the function that strips non-alphanumeric characters from encoding names.

I think you are misunderstanding the way the codec registry works.

You register codec search functions with it which then have to try
to map a given encoding name to a codec module.

The stdlib ships with one such function (defined in encodings/__init__.py).
This is registered with the codec registry per default.

The codec search function takes care of any normalization and conversion
to the module name used by the codecs from that codec package.

> It turns out that there are three "normalize" functions that are successively applied to the encoding name during evaluation of str.encode/str.decode.
> 
> 1. normalize_encoding() in unicodeobject.c

This was added to have the few shortcuts we have in the C code
for commonly used codecs match more encoding aliases.

The shortcuts completely bypass the codec registry and also
bypass the function call overhead incurred by codecs
run via the codec registry.

> 2. normalizestring() in codecs.c

This is the normalization applied by the codec registry. See PEP 100
for details:

"""
    Search functions are expected to take one argument, the encoding
    name in all lower case letters and with hyphens and spaces
    converted to underscores, ...
"""

> 3. normalize_encoding() in encodings/__init__.py

This is part of the stdlib encodings package's codec search
function.

> Each performs a slightly different transformation and only the last one strips non-alphanumeric characters.
> 
> The complexity of codec lookup is comparable with that of the import mechanism!

It's flexible, but not really complex.

I hope the above clarifies the reasons for the three normalization
functions.

History
Date	User	Action	Args
2011-02-24 09:29:14	lemburg	set	recipients: + lemburg, georg.brandl, belopolsky, pitrou, ezio.melotti, mrabarnett
2011-02-24 09:29:13	lemburg	link	issue5902 messages
2011-02-24 09:29:13	lemburg	create