Message 203955 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	doerwalter, ezio.melotti, lemburg, ncoghlan, python-dev, serhiy.storchaka, vstinner
Date	2013-11-23.02:16:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1385173006.27.0.868394662228.issue19619@psf.upfronthosting.co.za>
In-reply-to

Content
Just noting the exact list of codecs that currently bypass the full codec machinery and go direct to the C implementation by normalising the codec name (which includes forcing to lowercase) and then using strcmp to check against a specific set of known encodings. In PyUnicode_Decode (and hence bytes.decode and bytearray.decode): utf-8 utf8 latin-1 latin1 iso-8859-1 iso8859-1 mbcs (Windows only) ascii utf-16 utf-32 In PyUnicode_AsEncodedString (and hence str.encode), the list is mostly the same, but utf-16 and utf-32 are not accelerated (i.e. they're currently still looked up through the codec machinery). It may be worth opening a separate issue to restore the consistency between the lists by adding utf-16 and utf-32 to the fast path for encoding as well. As far as the wrapping mechanism from issue #17828 itself goes: - it only triggers if PyEval_CallObject on the encoder or decoder returns NULL - stateful exceptions (which includes UnicodeEncodeError and UnicodeDecodeError) and those with custom __init__ or __new__ implementations don't get wrapped - the actual wrapping process is just the C equivalent of "raise type(exc)(new_msg) from exc", plus the initial checks to determine if the current exception can be wrapped safely - it applies to the general purpose codec machinery, not just to the text model related convenience methods

Just noting the exact list of codecs that currently bypass the full codec machinery and go direct to the C implementation by normalising the codec name (which includes forcing to lowercase) and then using strcmp to check against a specific set of known encodings.

In PyUnicode_Decode (and hence bytes.decode and bytearray.decode):

utf-8
utf8
latin-1
latin1
iso-8859-1
iso8859-1
mbcs (Windows only)
ascii
utf-16
utf-32

In PyUnicode_AsEncodedString (and hence str.encode), the list is mostly the same, but utf-16 and utf-32 are not accelerated (i.e. they're currently still looked up through the codec machinery).

It may be worth opening a separate issue to restore the consistency between the lists by adding utf-16 and utf-32 to the fast path for encoding as well.

As far as the wrapping mechanism from issue #17828 itself goes:

- it only triggers if PyEval_CallObject on the encoder or decoder returns NULL
- stateful exceptions (which includes UnicodeEncodeError and UnicodeDecodeError) and those with custom __init__ or __new__ implementations don't get wrapped
- the actual wrapping process is just the C equivalent of "raise type(exc)(new_msg) from exc", plus the initial checks to determine if the current exception can be wrapped safely
- it applies to the *general purpose* codec machinery, not just to the text model related convenience methods

History
Date	User	Action	Args
2013-11-23 02:16:46	ncoghlan	set	recipients: + ncoghlan, lemburg, doerwalter, vstinner, ezio.melotti, python-dev, serhiy.storchaka
2013-11-23 02:16:46	ncoghlan	set	messageid: <1385173006.27.0.868394662228.issue19619@psf.upfronthosting.co.za>
2013-11-23 02:16:46	ncoghlan	link	issue19619 messages
2013-11-23 02:16:45	ncoghlan	create