Message129537
On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote:
>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> r88586: Normalized the encoding names for Latin-1 and UTF-8 to
> 'latin-1' and 'utf-8' in the stdlib.
Even though - or maybe exactly because - i'm a newbie, i really
want to add another message after all this biting is over.
I've just read PEP 100 and msg129257 (on Issue 5902), and i feel
a bit confused.
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> It turns out that there are three "normalize" functions that are
> successively applied to the encoding name during evaluation of
> str.encode/str.decode.
>
> 1. normalize_encoding() in unicodeobject.c
>
> This was added to have the few shortcuts we have in the C code
> for commonly used codecs match more encoding aliases.
>
> The shortcuts completely bypass the codec registry and also
> bypass the function call overhead incurred by codecs
> run via the codec registry.
The thing that i don't understand the most is that illegal
(according to IANA standarts) names are good on the one hand
(latin-1, utf-16-be), but bad on the other, i.e. in my
group-preserving code or haypos very fast but name-joining patch
(the first): a *local* change in unicodeobject.c, which' result is
*only* used for the two users PyUnicode_Decode() and
PyUnicode_AsEncodedString(). However:
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> Programmers who don't use the encoding names triggering those
> optimizations will still have a running program, it'll only be
> a bit slower and that's perfectly fine.
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> think rather than removing any hyphens, spaces, etc. the
> function should additionally:
>
> * add hyphens whenever (they are missing and) there's switch
> from [a-z] to [0-9]
>
> That way you end up with the correct names for the given set
> of optimized encoding names.
haypos patch can easily be adjusted to reflect this, resulting in
a much cleaner code in the two mentioned users, because
normalize_encoding() did the job it was ment for.
(Hmmm, and my own code could also be adjusted to match Python
semantics (using hyphen instead of space as a group-separator),
so that an end-user has the choice in between *all* IANA standart
names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"),
and would gain the full optimization benefit of using latin-1,
which seems to be pretty useful for limburger.)
> Ezio Melotti wrote:
> Marc-Andre Lemburg wrote:
>> That won't work, Victor, since it makes invalid encoding
>> names valid, e.g. 'utf(=)-8'.
>
> That already works in Python (thanks to encodings.normalize_encoding)
*However*: in PEP 100 Python has decided to go its own way
a decade ago.
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 2. normalizestring() in codecs.c
>
> This is the normalization applied by the codec registry. See PEP 100
> for details:
>
> """
> Search functions are expected to take one argument,
> the encoding name in all lower case letters and with hyphens
> and spaces converted to underscores, ...
> """
> 3. normalize_encoding() in encodings/__init__.py
>
> This is part of the stdlib encodings package's codec search function.
First: *i* go for haypo:
> It's funny: we have 3 functions to normalize an encoding name, and
> each function does something else :-)
(that's Issue 11322:)
> We should first implement the same algorithm of the 3 normalization
> functions and add tests for them
And *i* don't understand anything else (*i* do have *my* - now
furtherly optimized, thanks - s_textcodec_normalize_name()).
However, two different ones (very fast thing which is enough to
meet unicodeobject.c and a global one for anything else) may also do.
Isn't anything else a maintenance mess? Where is that database,
are there any known dependencies which are exposed to end-users?
Or the like.
I'm much too loud, and have a nice weekend. |
|
Date |
User |
Action |
Args |
2011-02-26 12:42:15 | sdaoden | set | recipients:
+ sdaoden, lemburg, rhettinger, jcea, belopolsky, pitrou, vstinner, ezio.melotti, eric.araujo |
2011-02-26 12:42:13 | sdaoden | link | issue11303 messages |
2011-02-26 12:42:13 | sdaoden | create | |
|