Message 107280 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	lemburg, pitrou, vstinner
Date	2010-06-07.21:25:02
SpamBayes Score	0.004720101
Marked as misclassified	No
Message-id	<4C0D63AC.4060705@egenix.com>
In-reply-to	<1275848637.97.0.606582513318.issue8922@psf.upfronthosting.co.za>

Content
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). > > PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1"). > > Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString(). The normalization in PyUnicode_Decode() must have been added to Python3 only. It is not present in Python2. I'm not sure whether it's a good idea to extend this further: the shortcuts were meant for Python internal use only. Python itself and it's stdlib should only use the shortcut names for the resp. special encodings and no variants. Dealing with variants and normalization is left to the encodings package and its alias machinery. Since the Python stdlib and the core already mostly use the shortcut names, adding normalization won't buy us much. Note that your change has also made it impossible for the compiler to do loop unrolling - there's not upper limit on the size of lower anymore. In terms of coding style, "static" should go on a separate line.

STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). 
> 
> PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1").
> 
> Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().

The normalization in PyUnicode_Decode() must have been added to
Python3 only. It is not present in Python2.

I'm not sure whether it's a good idea to extend this further:
the shortcuts were meant for Python internal use only. Python
itself and it's stdlib should only use the shortcut names
for the resp. special encodings and no variants.

Dealing with variants and normalization is left to the encodings
package and its alias machinery.

Since the Python stdlib and the core already mostly use
the shortcut names, adding normalization won't buy us much.

Note that your change has also made it impossible for the
compiler to do loop unrolling - there's not upper limit
on the size of lower anymore.

In terms of coding style, "static" should go on a separate line.

History
Date	User	Action	Args
2010-06-07 21:25:04	lemburg	set	recipients: + lemburg, pitrou, vstinner
2010-06-07 21:25:03	lemburg	link	issue8922 messages
2010-06-07 21:25:02	lemburg	create