Message 129537 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	sdaoden
Recipients	belopolsky, eric.araujo, ezio.melotti, jcea, lemburg, pitrou, rhettinger, sdaoden, vstinner
Date	2011-02-26.12:42:13
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<20110226124204.GB38708@sherwood.local>
In-reply-to	<1298648586.43.0.590678188608.issue11303@psf.upfronthosting.co.za>

Content
On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > r88586: Normalized the encoding names for Latin-1 and UTF-8 to > 'latin-1' and 'utf-8' in the stdlib. Even though - or maybe exactly because - i'm a newbie, i really want to add another message after all this biting is over. I've just read PEP 100 and msg129257 (on Issue 5902), and i feel a bit confused. > Marc-Andre Lemburg <mal@egenix.com> added the comment: > It turns out that there are three "normalize" functions that are > successively applied to the encoding name during evaluation of > str.encode/str.decode. > > 1. normalize_encoding() in unicodeobject.c > > This was added to have the few shortcuts we have in the C code > for commonly used codecs match more encoding aliases. > > The shortcuts completely bypass the codec registry and also > bypass the function call overhead incurred by codecs > run via the codec registry. The thing that i don't understand the most is that illegal (according to IANA standarts) names are good on the one hand (latin-1, utf-16-be), but bad on the other, i.e. in my group-preserving code or haypos very fast but name-joining patch (the first): a local change in unicodeobject.c, which' result is only used for the two users PyUnicode_Decode() and PyUnicode_AsEncodedString(). However: > Marc-Andre Lemburg <mal@egenix.com> added the comment: > Programmers who don't use the encoding names triggering those > optimizations will still have a running program, it'll only be > a bit slower and that's perfectly fine. > Marc-Andre Lemburg <mal@egenix.com> added the comment: > think rather than removing any hyphens, spaces, etc. the > function should additionally: > > * add hyphens whenever (they are missing and) there's switch > from [a-z] to [0-9] > > That way you end up with the correct names for the given set > of optimized encoding names. haypos patch can easily be adjusted to reflect this, resulting in a much cleaner code in the two mentioned users, because normalize_encoding() did the job it was ment for. (Hmmm, and my own code could also be adjusted to match Python semantics (using hyphen instead of space as a group-separator), so that an end-user has the choice in between all IANA standart names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"), and would gain the full optimization benefit of using latin-1, which seems to be pretty useful for limburger.) > Ezio Melotti wrote: > Marc-Andre Lemburg wrote: >> That won't work, Victor, since it makes invalid encoding >> names valid, e.g. 'utf(=)-8'. > > That already works in Python (thanks to encodings.normalize_encoding) However: in PEP 100 Python has decided to go its own way a decade ago. > Marc-Andre Lemburg <mal@egenix.com> added the comment: > 2. normalizestring() in codecs.c > > This is the normalization applied by the codec registry. See PEP 100 > for details: > > """ > Search functions are expected to take one argument, > the encoding name in all lower case letters and with hyphens > and spaces converted to underscores, ... > """ > 3. normalize_encoding() in encodings/__init__.py > > This is part of the stdlib encodings package's codec search function. First: i go for haypo: > It's funny: we have 3 functions to normalize an encoding name, and > each function does something else :-) (that's Issue 11322:) > We should first implement the same algorithm of the 3 normalization > functions and add tests for them And i don't understand anything else (i do have my - now furtherly optimized, thanks - s_textcodec_normalize_name()). However, two different ones (very fast thing which is enough to meet unicodeobject.c and a global one for anything else) may also do. Isn't anything else a maintenance mess? Where is that database, are there any known dependencies which are exposed to end-users? Or the like. I'm much too loud, and have a nice weekend.

On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> r88586: Normalized the encoding names for Latin-1 and UTF-8 to
> 'latin-1' and 'utf-8' in the stdlib.

Even though - or maybe exactly because - i'm a newbie, i really 
want to add another message after all this biting is over. 
I've just read PEP 100 and msg129257 (on Issue 5902), and i feel 
a bit confused.

> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> It turns out that there are three "normalize" functions that are 
> successively applied to the encoding name during evaluation of 
> str.encode/str.decode.
> 
> 1. normalize_encoding() in unicodeobject.c
>
> This was added to have the few shortcuts we have in the C code
> for commonly used codecs match more encoding aliases.
>
> The shortcuts completely bypass the codec registry and also
> bypass the function call overhead incurred by codecs
> run via the codec registry.

The thing that i don't understand the most is that illegal 
(according to IANA standarts) names are good on the one hand 
(latin-1, utf-16-be), but bad on the other, i.e. in my 
group-preserving code or haypos very fast but name-joining patch 
(the first): a *local* change in unicodeobject.c, which' result is 
*only* used for the two users PyUnicode_Decode() and 
PyUnicode_AsEncodedString().  However:

> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> Programmers who don't use the encoding names triggering those
> optimizations will still have a running program, it'll only be
> a bit slower and that's perfectly fine.

> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> think rather than removing any hyphens, spaces, etc. the
> function should additionally:
>
>  * add hyphens whenever (they are missing and) there's switch
>     from [a-z] to [0-9]
>
> That way you end up with the correct names for the given set 
> of optimized encoding names.

haypos patch can easily be adjusted to reflect this, resulting in 
a much cleaner code in the two mentioned users, because 
normalize_encoding() did the job it was ment for. 
(Hmmm, and my own code could also be adjusted to match Python 
semantics (using hyphen instead of space as a group-separator), 
so that an end-user has the choice in between *all* IANA standart 
names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"), 
and would gain the full optimization benefit of using latin-1, 
which seems to be pretty useful for limburger.)

> Ezio Melotti wrote:
> Marc-Andre Lemburg wrote:
>> That won't work, Victor, since it makes invalid encoding
>> names valid, e.g. 'utf(=)-8'.
>
> That already works in Python (thanks to encodings.normalize_encoding)

*However*: in PEP 100 Python has decided to go its own way 
a decade ago.

> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 2. normalizestring() in codecs.c
>
> This is the normalization applied by the codec registry. See PEP 100
> for details:
>
> """
>    Search functions are expected to take one argument, 
>    the encoding name in all lower case letters and with hyphens 
>    and spaces converted to underscores, ...
> """

> 3. normalize_encoding() in encodings/__init__.py
>
> This is part of the stdlib encodings package's codec search function.

First: *i* go for haypo:

> It's funny: we have 3 functions to normalize an encoding name, and
> each function does something else :-)

(that's Issue 11322:)
> We should first implement the same algorithm of the 3 normalization
> functions and add tests for them

And *i* don't understand anything else (*i* do have *my* - now 
furtherly optimized, thanks - s_textcodec_normalize_name()). 
However, two different ones (very fast thing which is enough to 
meet unicodeobject.c and a global one for anything else) may also do.
Isn't anything else a maintenance mess?  Where is that database, 
are there any known dependencies which are exposed to end-users?
Or the like.

I'm much too loud, and have a nice weekend.

History
Date	User	Action	Args
2011-02-26 12:42:15	sdaoden	set	recipients: + sdaoden, lemburg, rhettinger, jcea, belopolsky, pitrou, vstinner, ezio.melotti, eric.araujo
2011-02-26 12:42:13	sdaoden	link	issue11303 messages
2011-02-26 12:42:13	sdaoden	create