Message107280
STINNER Victor wrote:
>
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
>
> PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()).
>
> PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1").
>
> Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().
The normalization in PyUnicode_Decode() must have been added to
Python3 only. It is not present in Python2.
I'm not sure whether it's a good idea to extend this further:
the shortcuts were meant for Python internal use only. Python
itself and it's stdlib should only use the shortcut names
for the resp. special encodings and no variants.
Dealing with variants and normalization is left to the encodings
package and its alias machinery.
Since the Python stdlib and the core already mostly use
the shortcut names, adding normalization won't buy us much.
Note that your change has also made it impossible for the
compiler to do loop unrolling - there's not upper limit
on the size of lower anymore.
In terms of coding style, "static" should go on a separate line. |
|
Date |
User |
Action |
Args |
2010-06-07 21:25:04 | lemburg | set | recipients:
+ lemburg, pitrou, vstinner |
2010-06-07 21:25:03 | lemburg | link | issue8922 messages |
2010-06-07 21:25:02 | lemburg | create | |
|